SHSaquib Hasnain
Work
AI product systemLive

DocBridge AI

The step RAG tutorials skip: turning policy PDFs, scanned notices, and agent shorthand into clean, structured output that's actually safe to embed.

Documents are cleaned, confidence-scored, and routed before ingestion — not after hallucinations appear.

The problem

Pick any RAG tutorial. It will start with a clean markdown file. Nicely structured, headers in the right places, ready to chunk and embed. You follow along, it works, and you think — okay, I can build this.

Then you try it on your actual documents.

A policy PDF from 2018, scanned and converted, where the headers got extracted as body text and every page has a footer that shows up as a stray paragraph. An agent notes spreadsheet where the whole resolution column reads like cust acct bal avail 0, auth decl, chrgbck pndng — escl to sup. A Word document where someone left Track Changes on, so half the content is buried under strikethroughs.

That's what I would have ended up feeding into two different pipelines — NextGen Capital RAG and AI Servicing Intelligence. Different projects, same problem waiting to happen: before anything could be retrieved or clustered or analyzed, the input would need to be readable first.

What would you actually want?

Something that handles the cleaning before the pipeline even starts. Whatever format the document comes in — PDF, Word, scanned notice, spreadsheet with shorthand — you'd want it normalized into clean output, with a clear signal about which files you can trust and which ones need a human to look at before they go anywhere near a vector database.

And you'd want to catch the bad ones before ingestion, not after your system starts citing phantom sections from a garbled OCR output.

So, what is DocBridge AI?

A document normalization pipeline that runs before the RAG layer. I built it because I could see the input problem coming while working on my other projects — better to solve it once, properly, than patch it inside each pipeline separately.

It's live on Streamlit Community Cloud. This is a proof of concept, so it's capped at 5 files per session — but the architecture is stateless per document, so scaling it up is mostly an infrastructure decision, not a pipeline one. Upload your files, configure processing options, and get cleaned outputs with a confidence score and routing decision for each one.

How it works

Two modes depending on what you're feeding in.

Document mode handles unstructured files — PDFs, Word docs, Markdown. Each file goes through type detection, content extraction (PyMuPDF for readable PDFs, Tesseract OCR for scanned ones, python-docx for Word files), and text cleaning. The output is a clean markdown file with YAML frontmatter: doc ID, title, source format, extraction method, confidence score, processing status. Tables from Word files are rendered as markdown pipe tables in their correct document position. The output is ready to hand off to a RAG ingestion pipeline.

Tabular mode handles structured data where the content is informal — agent notes, customer interaction logs, support records in CSV or Excel. Each row gets expanded. cust acct bal avail 0, auth decl, chrgbck pndng becomes "Customer account balance available is 0, authorization declined, chargeback pending." Original columns stay, cleaned versions are added alongside, with per-row confidence scores and quality flags.

Shorthand expansion uses two layers: a curated banking and fintech glossary first — fast, deterministic, auditable — then GPT-4o-mini for anything context-dependent or ambiguous. The processing report logs which terms went through each layer.

No pipeline is complete without a quality check

Each file comes out with a confidence score and a routing decision. ≥0.85 is approved — output written, ready to ingest. 0.60–0.84 is flagged for a human check. Below 0.60 is strongly flagged. And if extraction failed entirely, the file is rejected with a reason in the report.

I added this step because bad input doesn't announce itself. A garbled OCR result or a wall of unexpanded shorthand gets embedded quietly, and the answers it produces degrade gradually — slightly wrong, hard to trace, easy to miss until someone flags something that doesn't add up. Catching quality issues before ingestion beats untangling them afterward.

Where it stands

38 tests passing. Deployed. Ready to be put to work when the other projects need it — which is also why it ended up as its own repo instead of a module buried inside either one.