AI / Document Intelligence

Financial Document Research System

Industry: Higher Education / Academic Research Year: 2025–2026

Research Pipeline Dashboard showing real-time processing status of 4,444 ticker/year pairs

The Challenge

A university research team needed to find, retrieve, and extract specific accounting data from the annual reports of approximately 850 publicly listed companies, across 5 fiscal years — over 4,400 reports in total.

Each report required searching for the correct document online, downloading the PDF, then manually locating and extracting 18 specific data points. Previous iterations of this research had shown that each report took roughly 10 minutes of focused work to process, and a researcher could only sustain that pace for a few hours before both productivity and accuracy began to decline.

The math was daunting: 4,400+ reports at 10 minutes each meant approximately 700 hours of manual work — effectively 6 to 9 months of dedicated effort for a single research cycle. And with accuracy degrading over time, the quality of the extracted data was always a concern.

The Solution

We built a custom research pipeline that combines AI and purpose-built tooling, using each where it is strongest:

1. AI-Driven Search & Selection

AI evaluates search results from public sources, identifies the correct annual report for each company-year pair, and selects the best download option — eliminating hours of manual searching.

2. Local PDF Parsing

In-house parsing infrastructure extracts text from the downloaded PDFs — faster, cheaper, and more reliable than sending entire documents to an AI service.

3. AI Data Extraction

AI reads the parsed text and extracts the 18 target data points from each report — accounting values, commentary, and contextual notes — with source page citations for verification.

4. Semantic Search (RAG)

All parsed report text is vector-embedded and stored in PostgreSQL with pgvector. Researchers can query the full corpus in natural language — cross-company comparisons or single-company deep dives.

The system includes a full management dashboard for monitoring pipeline progress, batch processing, run status, and a web-based query interface where researchers type questions and get answers with source citations.

RAG Query Interface — researchers ask natural language questions about annual reports with cross-company and single-company modes

The query interface supports cross-company comparisons and single-company exploration with natural language questions.

The Results

4,444

ticker/year pairs processed

~700 hrs

of tedious manual work avoided

data points extracted per report

✓ Months of tedious, repetitive manual work replaced by an automated pipeline
✓ Consistent extraction accuracy across all reports — no fatigue-related decline
✓ Every extracted value includes a source citation with page number for verification
✓ Parsed data can be re-analysed to extract new data points beyond the original spec — no need to reprocess the source documents
✓ Full corpus is searchable using both traditional keyword search and AI-powered semantic search
✓ Pipeline is reusable — adding new fiscal years or companies is incremental, not a restart

Technology

Python Claude API (Anthropic) PostgreSQL + pgvector In-house PDF Parsing Vector Embeddings Flask Web App

Have a similar challenge?

If your team is spending hours extracting data from documents, searching through reports, or doing repetitive research — there's probably a better way. Let's talk.

Get in Touch