System Architecture

AFLHR Lite uses a two-layer pipeline: RAG retrieval followed by NLI verification.

Pipeline Stages

Stage 1 — Retrieve Evidence

The user query is embedded using a sentence-transformer model and compared against a FAISS index of pre-embedded knowledge passages. The top-k most similar passages are returned with a retrieval score (cosine similarity), which measures how relevant the evidence is.

v1: all-MiniLM-L6-v2 (384 dimensions)
v2: BAAI/bge-small-en-v1.5 (384 dimensions, better quality)

Stage 2 — Generate Response

The retrieved passages and original query are sent to an LLM (Llama-3.1-8B-Instant via Groq API) to generate a natural language response. In offline mode, a mock response is used instead.

Stage 3 — Verify via NLI

The generated response (hypothesis) is checked against the retrieved passages (premise) using RoBERTa-large-MNLI. The model outputs an entailment probability — how strongly the evidence supports the response.

v2 improvements:

Sliding-window NLI — premises longer than 512 tokens are split into overlapping 400-token windows with 200-token stride. The maximum entailment score across windows is used.
Claim decomposition — the response is split into individual sentences. Each claim is verified independently, and the minimum score (weakest link) becomes the final NLI score.

Stage 4 — Adaptive Verdict

The NLI score is compared against a threshold T(rs) that depends on retrieval confidence. This is the Cw-CONLI mechanism:

If nli_score >= T(rs) → VERIFIED
If nli_score < T(rs) → HALLUCINATION

When retrieval confidence is low, the threshold is strict (harder to pass). When retrieval confidence is high, the threshold is lenient (evidence is trusted).

Models

Component	v1	v2	Purpose
Embeddings	`all-MiniLM-L6-v2`	`BAAI/bge-small-en-v1.5`	Semantic similarity for retrieval
NLI Verifier	`RoBERTa-large-MNLI`	same	Entailment scoring
LLM Generator	`Llama-3.1-8B-Instant` (Groq)	same	Response generation

Project Structure

Shaun_FYP/
├── api.py              # FastAPI backend (v1 + v2 engine support)
├── engine.py           # Core AFLHREngine class
├── config.py           # Configuration, model IDs, thresholds
├── dataset.py          # HaluEval dataset loader
├── evaluate.py         # Evaluation harness (precompute + conditions)
├── tune.py             # Grid search hyperparameter tuning
├── analyze.py          # Results analysis, plots, McNemar's test
├── calibrate.py        # NLI temperature scaling (investigated, disabled)
├── run_v2.py           # Automated v2 experiment pipeline
├── frontend/           # React + Vite frontend
│   ├── src/components/ # CircularGauge, VerdictStamp, ThresholdPanel, ...
│   ├── src/pages/      # VerifyPage, ExplorePage, AboutPage
│   └── src/styles/     # Design system (theme.js, global.css)
├── docs/               # Docusaurus documentation (this site)
└── results/            # Experiment outputs (CSVs, JSONs, figures)

Tech Stack

Layer	Technology
Backend	Python 3.10+, FastAPI, Uvicorn
ML/NLP	sentence-transformers, FAISS, HuggingFace Transformers
LLM	Llama-3.1-8B-Instant via Groq API
Frontend	React 18, Vite 5, React Router 7, Framer Motion, Recharts
Documentation	Docusaurus
Build	GNU Make, pip, npm

Pipeline Stages​

Stage 1 — Retrieve Evidence​

Stage 2 — Generate Response​

Stage 3 — Verify via NLI​

Stage 4 — Adaptive Verdict​

Models​

Project Structure​

Tech Stack​