SHSaquib Hasnain
Work
AI product systemIn progress

NextGen Capital RAG Intelligence

The difference between a chatbot demo and a trustworthy AI product: grounded answers, retrieval quality, and a real evaluation loop.

Answers are grounded in cited sources, and quality is measured rather than assumed.

The problem

A customer calls. They're frustrated — their autopay was cancelled and they want to know why. You explain it. Then they want their payoff amount. And then, since they're already on the phone, they ask how to close the account entirely. Three questions. Completely different sources. Empathy needed throughout. Clock ticking.

That's a normal call. There are hundreds of them every day.

The problem isn't that agents don't know the policies. It's that finding them takes real time — and the information isn't all in one place. Process documents spread across internal portals, Word files on shared drives, maybe a database lookup for account-specific details. And when two documents address the same policy, which version is current? The one from last quarter, or the updated one someone circulated last month? Under time pressure, on a live call, these aren't small details.

What would you actually want?

Imagine the customer's question hits the agent's screen — and right next to it, an answer appears. Grounded in actual policy, with citations. Something the agent can read in five seconds and decide: send it, edit it, or skip it.

And if the knowledge base doesn't have a reliable answer? You'd want to know that immediately — so the agent can flag it for escalation (handle it on the call, or follow up via email) instead of burning three minutes searching for something that isn't there.

So, what is NextGen Capital RAG?

Honestly, the long-term vision is bigger. I think someday this becomes an independent agent running customer-facing — handling routine queries end to end — and human agents are called in only as the last line of support. Not removed from the loop, but deployed where they actually matter.

In financial services especially, you need that backstop. This might look like a failure mode — the system saying "I can't answer this." It isn't. I would much rather have a handoff than a confident wrong answer about someone's account.

But being practical: while I was designing it, I had two things in mind.

First, the ability to test it thoroughly before trusting it with anything real — evals, filters, retrieval scoring, run comparisons. Second, a V1 that's agent-facing and fully in their control: the customer query appears in the chat window, a suggested response pops up from the RAG, and the agent reviews it, sends it, or edits it before it goes. If something's wrong, they flag it. That feedback improves the system. That's the loop.

How it works

A question comes in. Before anything touches the knowledge base, a validator screens it — prompt injection, off-topic queries, anything outside the supported domain gets stopped early.

For account-specific questions, a separate lookup runs against a demo customer database — realistic account data (balances, due dates, account type, credit limits) built specifically so you can test account queries without touching real customer information.

For everything else: an intent router classifies the question. Is this a simple product fact, a policy lookup, or something that requires synthesizing across multiple documents? That classification matters — different question types need different handling, and routing them correctly is what keeps the quality consistent.

The retriever pulls matching chunks, a reranker scores them for relevance, and then an evidence sufficiency gate runs before generation. If what came back isn't strong enough to actually answer the question, the system flags it for escalation rather than generating something weak. If the evidence holds, the answer comes back structured — answer text, confidence, inline citations tied to specific source documents. Every turn is logged.

Now you might say: this is still RAG.

And I'd say yes. But not the 20-minute kind.

When I started building it, the core seemed straightforward. Then I hit the actual problems.

Multi-part questions. A customer asks "what's my payoff amount and is there a penalty if I pay off early?" — that's two separate questions requiring two different retrieval paths. The first hits the account database for a live balance. The second hits the knowledge base for prepayment policy. The system detects the split, runs each sub-question through its own pipeline, and synthesizes a single coherent answer. You'd never know unless you looked at the trace.

Multi-document sources — with version conflicts. The knowledge base spans credit card products, personal loan policies, operations guides, compliance documents, and regulatory guidance. You can't apply one chunking strategy across all of them. A fee schedule and a compliance notice have completely different structures. And when two documents address the same policy topic, you need to know which one is authoritative. Section-aware chunking is the default, but chunking strategy is configurable per index — because the same approach that works for a product FAQ will quietly destroy a compliance document's citation chain.

Multiple LLMs, not just one. Simple product facts route to a lighter, cheaper model. Complex multi-document synthesis gets a stronger one. The system supports OpenAI, Groq, OpenRouter, and Cerebras as providers — including open-source models like gpt-oss-120b available through Groq. Not every query needs GPT-4. Building against a single provider bakes in assumptions you'll regret later.

The point isn't the chatbot

I'm not excited about this because it uses AI. I don't care if it's a chatbot, a search bar, or a panel that just returns a cited paragraph. The form doesn't matter.

What matters is this: wrong information in financial services has consequences. An agent gives an incorrect payoff amount — the customer makes a payment they didn't intend, or misses one they should have made. That's a credit impact. That's a follow-up call. That's a complaint. Agents aren't making those mistakes because they're careless. They're making them because the information environment they work in is genuinely hard to navigate.

A system that retrieves the right answer quickly, cites where it came from, and says "I don't know" when it doesn't — that's useful regardless of what you call it.

And once it's accurate enough: a support agent handling 20 complex queries a day, each taking 20 minutes, drops to 15 minutes when retrieval is instant and the answer is right there. That's 100 minutes reclaimed daily — time they can spend on what actually needs a human: helping a frustrated customer feel heard, explaining a difficult situation with care, making a judgment call that no document can make for them.

The eval loop

Here's the real work: run the test suite. Something breaks, or scores poorly. Capture it. Analyse what went wrong — was it the wrong chunk retrieved? A bad prompt? The wrong model for that query type? Update the code or the prompt. Run it again.

Every eval run is named and stored. The eval-compare command shows exactly how metrics moved between runs: retrieval accuracy, citation support, correct escalation rate, LLM judge scores across multiple dimensions. There's a 65-question golden dataset covering product facts, policy questions, compliance cases, multi-document synthesis, account lookups, questions that should escalate, and adversarial inputs that should be blocked. The Conversation Review page lets you promote real test cases directly into the golden dataset — which is how it grew from 50 to 65 examples.

That's how you build a system that actually improves. A new support agent might take two years to really know the product. A system with a tight feedback loop can get there in months — and unlike a human, it gets tested on 65 scenarios before it ever talks to a customer.

Where it stands

228 tests passing. 22 sprints after the MVP. The question was never "is it done?" It was always "can you tell if it's getting better?" Here, you can.