Skip to main content

Cw-CONLI Algorithm

Research Question

Should retrieval quality influence how strict NLI verification is?

Standard CONLI (Microsoft, 2024) uses a fixed threshold — the same verification strictness regardless of whether the retrieved evidence is highly relevant or barely related. Cw-CONLI introduces confidence-weighted thresholds that adapt based on retrieval score.

Experimental Conditions

ConditionNameDescription
C1RAG-onlyNo NLI verification (baseline)
C2Static CONLIFixed threshold regardless of retrieval confidence
C3Cw-CONLIAdaptive threshold weighted by retrieval confidence

Threshold Functions

C3 is implemented as three mathematical variants:

Tiered (Step Function)

T(rs) = T_strict   if rs < pivot
T_lenient otherwise

A binary step controlled by the pivot parameter. Simple and interpretable but discontinuous at the pivot.

Parameters: pivot, T_strict, T_lenient

Square Root (Continuous)

T(rs) = T_strict − (T_strict − T_lenient) · √rs

A smooth, monotonically decreasing function. The concave shape means modest retrieval improvements produce meaningful threshold relaxation.

Parameters: T_strict, T_lenient

Sigmoid (Logistic)

T(rs) = T_lenient + (T_strict − T_lenient) / (1 + exp(k · (rs − pivot)))

An S-shaped logistic transition. Parameter k controls steepness. Most flexible but largest hyperparameter space.

Parameters: T_strict, T_lenient, pivot, k

Classification Rule

For all variants, the verdict logic is identical:

if nli_score >= T(rs):
verdict = VERIFIED
else:
verdict = HALLUCINATION

Relationship to Prior Work

Microsoft CoNLIAFLHR Lite (Cw-CONLI)
NLI engineGPT-3.5/GPT-4 via promptingRoBERTa-large-MNLI (local)
RetrievalNone — source document givenFAISS + sentence embeddings
DetectionSentence → entity-level (Azure NER)Sentence decomposition + sliding-window NLI
MitigationGPT rewrites flagged textDetection only
InfrastructureAzure OpenAI + Azure Text AnalyticsRuns locally on CPU
Novel additionConfidence-weighted adaptive thresholds

v2 Engineering Improvements

ImprovementProblem AddressedApproach
Sliding-window NLIRoBERTa 512-token limit truncates long premisesSplit into overlapping 400-token windows, take max entailment
Claim decompositionWhole-response NLI misses partial hallucinationsSplit into sentences, verify each, take min score
BGE embeddingsMiniLM-L6-v2 is outdated (2021)Upgrade to BAAI/bge-small-en-v1.5
Temperature scalingRaw NLI softmax outputs may be uncalibratedInvestigated (Guo et al. 2017); T=10.0 at boundary — disabled