Skip to main content

System Architecture

AFLHR Lite uses a two-layer pipeline: RAG retrieval followed by NLI verification.

Pipeline Stages

Stage 1 — Retrieve Evidence

The user query is embedded using a sentence-transformer model and compared against a FAISS index of pre-embedded knowledge passages. The top-k most similar passages are returned with a retrieval score (cosine similarity), which measures how relevant the evidence is.

  • v1: all-MiniLM-L6-v2 (384 dimensions)
  • v2: BAAI/bge-small-en-v1.5 (384 dimensions, better quality)

Stage 2 — Generate Response

The retrieved passages and original query are sent to an LLM (Llama-3.1-8B-Instant via Groq API) to generate a natural language response. In offline mode, a mock response is used instead.

Stage 3 — Verify via NLI

The generated response (hypothesis) is checked against the retrieved passages (premise) using RoBERTa-large-MNLI. The model outputs an entailment probability — how strongly the evidence supports the response.

v2 improvements:

  • Sliding-window NLI — premises longer than 512 tokens are split into overlapping 400-token windows with 200-token stride. The maximum entailment score across windows is used.
  • Claim decomposition — the response is split into individual sentences. Each claim is verified independently, and the minimum score (weakest link) becomes the final NLI score.

Stage 4 — Adaptive Verdict

The NLI score is compared against a threshold T(rs) that depends on retrieval confidence. This is the Cw-CONLI mechanism:

  • If nli_score >= T(rs)VERIFIED
  • If nli_score < T(rs)HALLUCINATION

When retrieval confidence is low, the threshold is strict (harder to pass). When retrieval confidence is high, the threshold is lenient (evidence is trusted).

Models

Componentv1v2Purpose
Embeddingsall-MiniLM-L6-v2BAAI/bge-small-en-v1.5Semantic similarity for retrieval
NLI VerifierRoBERTa-large-MNLIsameEntailment scoring
LLM GeneratorLlama-3.1-8B-Instant (Groq)sameResponse generation

Project Structure

Shaun_FYP/
├── api.py # FastAPI backend (v1 + v2 engine support)
├── engine.py # Core AFLHREngine class
├── config.py # Configuration, model IDs, thresholds
├── dataset.py # HaluEval dataset loader
├── evaluate.py # Evaluation harness (precompute + conditions)
├── tune.py # Grid search hyperparameter tuning
├── analyze.py # Results analysis, plots, McNemar's test
├── calibrate.py # NLI temperature scaling (investigated, disabled)
├── run_v2.py # Automated v2 experiment pipeline
├── frontend/ # React + Vite frontend
│ ├── src/components/ # CircularGauge, VerdictStamp, ThresholdPanel, ...
│ ├── src/pages/ # VerifyPage, ExplorePage, AboutPage
│ └── src/styles/ # Design system (theme.js, global.css)
├── docs/ # Docusaurus documentation (this site)
└── results/ # Experiment outputs (CSVs, JSONs, figures)

Tech Stack

LayerTechnology
BackendPython 3.10+, FastAPI, Uvicorn
ML/NLPsentence-transformers, FAISS, HuggingFace Transformers
LLMLlama-3.1-8B-Instant via Groq API
FrontendReact 18, Vite 5, React Router 7, Framer Motion, Recharts
DocumentationDocusaurus
BuildGNU Make, pip, npm