FinLaw-UK
Graph-augmented RAG for UK financial regulation
- Timeline
- Sep 2024 — Sep 2025
- Role
- Solo MSc dissertation, University of Bradford
- Status
- Research
- Primary stack
- Mistral 7B · Neo4j · Sentence Transformers · RAGAS
Headline metrics
What this project tackles
UK financial regulation is a moving target. The FCA Handbook alone runs to thousands of pages, cross-referenced with MiFID II, the PRA Rulebook, and binding technical standards. Compliance teams burn hours threading citations across documents, and naive LLM lookups hallucinate confidently in exactly the places that matter most.
Off-the-shelf RAG fails here for two reasons. Dense retrieval surfaces semantically similar passages but misses the regulatory entity graph — a single rule is meaningful only in the context of its parent chapter, the obligated entities, and the cross-references it triggers. And without faithfulness evaluation, you can't tell a polished answer from a hallucinated one.
FinLaw-UK was my MSc dissertation: a graph-augmented RAG pipeline that retrieves both semantically and structurally, generates with a small open-weight model, and audits every answer against retrieved context using RAGAS.
System design
The pipeline begins with bulk ingestion of FCA Handbook chapters and adjacent regulatory documents. Each section is chunked at the smallest semantic unit — typically a single rule or sub-rule — then run through a parallel two-stream extraction: Sentence Transformer embeddings into a vector index, and entity/relationship extraction into a Neo4j knowledge graph that captures Rule → Chapter, Rule → Entity, and Rule → Cross-reference relationships.
At query time, dense retrieval pulls the top-K candidate chunks. The candidates' graph nodes are then expanded one hop in Neo4j to add adjacent rules, parent chapter context, and any cross-referenced sections — the structural context that pure vector search loses. The expanded set is re-ranked by relevance and trimmed to fit Mistral 7B-Instruct's context window.
Generation runs locally on Mistral 7B-Instruct with strict citation-required prompting: every claim must reference a chunk ID, and every chunk ID must exist in the retrieved set. Post-generation, every response goes through a RAGAS evaluator that scores faithfulness (does the answer stay grounded in retrieved context?) and answer relevance (does it actually address the query?).
The result: 0.76 faithfulness and 0.74 answer relevance on a held-out evaluation set, with a 19% accuracy gain over a vector-only baseline.
Key technical decisions
— Graph expansion over reranking
A cross-encoder reranker would have improved relevance at fixed K, but the failure mode wasn't ranking — it was missing context. Graph expansion captures the 'Rule X is meaningless without Rule Y next to it' pattern that no reranker can recover.
— Mistral 7B over a frontier model
A 70B+ model would lift answer quality, but a frontier API on financial text creates a vendor dependency and a data egress problem the project couldn't accept. 7B-Instruct ran locally on a single GPU and proved that the structural retrieval improvements transferred regardless of generator size.
— Neo4j over a vector-only store
Pinecone or Qdrant alone would have been faster to ship, but the regulatory cross-reference graph is the actual moat. Storing it in a graph DB lets retrieval expand structurally — a query for one rule pulls in the chapter, the entities, and the cross-references in a single Cypher hop.
— RAGAS over BLEU/ROUGE
Surface metrics reward fluent paraphrasing. Faithfulness and answer relevance both require an LLM judge against the retrieved context, which is what actually matters for regulatory text where one wrong citation can be liability.
What it delivers
The +19% accuracy gain over a vector-only baseline came primarily from queries where the answer required understanding regulatory hierarchy — exactly the cases where graph expansion adds context that dense retrieval misses on its own.
Faithfulness at 0.76 means roughly three in four answers stay grounded in retrieved context; the 24% that drift are the prompt-engineering targets for next-iteration work. Answer relevance at 0.74 tracks closely, suggesting the model is staying on-topic when it stays grounded.
What I'd do next
If I were continuing this past the dissertation, the next move is two-pronged. First, replace the soft-vote evaluation harness with a structured legal-reasoning benchmark — RAGAS catches faithfulness drift but not legal-specific failure modes like jurisdictional misapplication. Second, ship a confidence-aware UI that surfaces uncertainty when the graph expansion returns sparse adjacency, so users know when the system is reasoning from rich vs thin context.
/* TODO: Hammad — add a paragraph here about what surprised you in the eval, or what you'd warn the next person doing graph-augmented RAG to budget time for. */
Continue reading
Autonomous Voice Agent
2,100+ concurrent AI sales calls at 1.1s latency
DiabetesSense
93% accurate clinical risk scoring with SHAP interpretability