Skip to main content

Experimental Results

Evaluated on HaluEval (10K QA + 10K Summarisation samples) with a 70/30 dev/test split.

Summary

Metricv2 (standard)v2 (realistic)
Combined F1 (best C3)0.69980.6558
QA F10.770 (C2 = C3)
QA Over-flagging11.2% (C3 Tiered)
SummarisationF1 ≈ 0.663 (fixed from 99%+ FPR in v1)
Realistic FPR (C2 → C3 Sqrt)100% → 44.9%
C3 vs C2 significant?No (p = 1.0)Yes (p = 0.000014)

Combined Test Set — Standard (n = 6,000)

ConditionF1PrecisionRecallOver-flagging
C1 (RAG-only)0.0000.0000.0000.0%
C2 (Static CONLI)0.69880.60140.833855.3%
C3 Tiered0.69980.60100.837455.7%
C3 Sqrt0.69980.60080.837855.7%
C3 Sigmoid0.69920.60050.836855.7%

QA Test Set (n = 2,959)

ConditionF1PrecisionRecallFPR
C2 (Static)0.77020.84180.709813.6%
C3 Tiered0.76790.86290.691711.2%
C3 Sqrt0.76180.87760.67299.5%
C3 Sigmoid0.76870.82210.721815.9%

Realistic Retrieval — Shared Index (n = 5,918)

This experiment uses a shared FAISS index across all QA samples (realistic RAG conditions) instead of per-sample ground-truth retrieval.

ConditionF1PrecisionRecallFPR
C2 (Static)0.67010.50410.9993100.0%
C3 Tiered0.65580.52360.877381.2%
C3 Sqrt0.64140.60670.680344.9%
C3 Sigmoid0.64000.56040.746059.5%

F1 Comparison — Realistic Retrieval

Retrieval Score Distribution — Realistic

Statistical Significance (McNemar's Test)

SplitStatisticp-valueSignificant?
Combined (standard)0.0001.0No
QA0.5440.461No
Summarisation0.0560.814No
Realistic18.8840.000014Yes

Key Finding

On HaluEval's standard per-sample retrieval, C3 and C2 converge to the same operating point — retrieval scores cluster too tightly (QA std = 0.075) for adaptive thresholds to differentiate. Under realistic shared-index retrieval, the difference is significant: C2 flags 100% of correct responses while C3 Sqrt reduces over-flagging to 44.9%.

The primary contribution is v2 engineering: sliding-window NLI fixes summarisation (99% FPR → functional), claim decomposition catches partial hallucinations, and BGE embeddings improve retrieval fidelity.

Running Experiments

# Full v2 pipeline (recommended)
python run_v2.py # ~4h GPU, ~24h CPU

# Step by step
python calibrate.py --split dev
python evaluate.py --precompute --split dev --version v2
python evaluate.py --precompute --split test --version v2
python tune.py --split dev --version v2
python evaluate.py --condition C3 --split test --version v2
python analyze.py --split test --version v2