Cw-CONLI Algorithm

Research Question

Should retrieval quality influence how strict NLI verification is?

Standard CONLI (Microsoft, 2024) uses a fixed threshold — the same verification strictness regardless of whether the retrieved evidence is highly relevant or barely related. Cw-CONLI introduces confidence-weighted thresholds that adapt based on retrieval score.

Experimental Conditions

Condition	Name	Description
C1	RAG-only	No NLI verification (baseline)
C2	Static CONLI	Fixed threshold regardless of retrieval confidence
C3	Cw-CONLI	Adaptive threshold weighted by retrieval confidence

Threshold Functions

C3 is implemented as three mathematical variants:

Tiered (Step Function)

T(rs) = T_strict   if rs < pivot
         T_lenient  otherwise

A binary step controlled by the pivot parameter. Simple and interpretable but discontinuous at the pivot.

Parameters: pivot, T_strict, T_lenient

Square Root (Continuous)

T(rs) = T_strict − (T_strict − T_lenient) · √rs

A smooth, monotonically decreasing function. The concave shape means modest retrieval improvements produce meaningful threshold relaxation.

Parameters: T_strict, T_lenient

Sigmoid (Logistic)

T(rs) = T_lenient + (T_strict − T_lenient) / (1 + exp(k · (rs − pivot)))

An S-shaped logistic transition. Parameter k controls steepness. Most flexible but largest hyperparameter space.

Parameters: T_strict, T_lenient, pivot, k

Classification Rule

For all variants, the verdict logic is identical:

if nli_score >= T(rs):
    verdict = VERIFIED
else:
    verdict = HALLUCINATION

Relationship to Prior Work

	Microsoft CoNLI	AFLHR Lite (Cw-CONLI)
NLI engine	GPT-3.5/GPT-4 via prompting	RoBERTa-large-MNLI (local)
Retrieval	None — source document given	FAISS + sentence embeddings
Detection	Sentence → entity-level (Azure NER)	Sentence decomposition + sliding-window NLI
Mitigation	GPT rewrites flagged text	Detection only
Infrastructure	Azure OpenAI + Azure Text Analytics	Runs locally on CPU
Novel addition	—	Confidence-weighted adaptive thresholds

v2 Engineering Improvements

Improvement	Problem Addressed	Approach
Sliding-window NLI	RoBERTa 512-token limit truncates long premises	Split into overlapping 400-token windows, take max entailment
Claim decomposition	Whole-response NLI misses partial hallucinations	Split into sentences, verify each, take min score
BGE embeddings	MiniLM-L6-v2 is outdated (2021)	Upgrade to BAAI/bge-small-en-v1.5
Temperature scaling	Raw NLI softmax outputs may be uncalibrated	Investigated (Guo et al. 2017); T=10.0 at boundary — disabled

Research Question​

Experimental Conditions​

Threshold Functions​

Tiered (Step Function)​

Square Root (Continuous)​

Sigmoid (Logistic)​

Classification Rule​

Relationship to Prior Work​

v2 Engineering Improvements​