Build Pipelines

Financial Document Table Extraction

Table Transformer + TrOCR detecting and extracting structured tables from scanned PDFs with 96%+ accuracy.

The problem

Financial tables in scanned PDFs, with merged cells and nested structures, resist standard OCR and don't come out as usable structured data.

A merged cell turns one number into three garbled fragments.
Someone still re-types figures from a PDF into a spreadsheet by hand.
OCR gets the digits right but loses which row and column they belonged to.

What NEO built

NEO combined Table Transformer for table detection with TrOCR for cell-level OCR, post-processing the result into clean CSV/JSON.

Table TransformerTrOCRPDF parsing

The result

96%+ extraction accuracy

Hits 96%+ table extraction accuracy across multi-page financial documents, including irregular table structures.

From the blog · 8 min

Automated Table Extraction from Financial Documents Using Transformer Models

A deep dive into our production financial OCR pipeline that uses Microsoft Table Transformer and TrOCR to detect, extract, and validate structured tables from PDFs and scanned documents with 96%+ accuracy.

Try this in your workspace

Paste this into NEO chat to kick off the same workflow on your own data.

NEO chat

Extract the tables out of these scanned financial PDFs, even the ones with merged cells, and give me clean structured CSV/JSON instead of raw OCR text.

Paste it in · review the plan · get the diff

Get NEO

Financial Document Table Extraction

Automated Table Extraction from Financial Documents Using Transformer Models

Try this in your workspace

More Build Pipelines use cases

Multimodal RAG System

Real-Time Voice Translation