How I Built the Financial Statement Analyzer
Schema-first AI extraction, ratio computation and competitor benchmarking — architecture and lessons from building an automated financial analysis pipeline.
The problem this tool was built to solve is a familiar one in enterprise finance: annual report analysis is slow, inconsistent, and depends entirely on the analyst doing it. A thorough review of a single set of financial statements — income statement, balance sheet, cash flow — takes hours. Comparing against three competitors takes a day. Repeating that across a portfolio of suppliers or investment targets is genuinely impractical without a team.
The Financial Statement Analyzer automates the extraction, computation, and benchmarking process. A user uploads one or more PDF annual reports. Within minutes the system returns a structured analysis: key financial ratios, trend lines across periods, and a benchmarked view against peer companies. The output is consistent, auditable, and reproducible — the same report on the same document produces the same result every time.
Architecture
The pipeline has four stages. Stage one is ingestion: PDFs are parsed using a combination of pdfplumber for structured tables and a pre-processing step that cleans the text before it reaches the model. Stage two is extraction: an OpenAI model with a structured output schema identifies and pulls the financial line items — revenue, EBITDA, net profit, total assets, total liabilities, operating cash flow, and so on — across each reporting period. The schema-first approach is the critical design decision here.
Stage three is computation: once the raw line items are extracted, a deterministic Python layer calculates the ratios — gross margin, EBITDA margin, return on equity, current ratio, debt-to-equity, and others — without involving the model. This separation is deliberate. Arithmetic done by a language model introduces error risk; arithmetic done by code is reliable. Stage four is benchmarking: when multiple companies are analysed, the system produces a comparison table and a narrative summary highlighting where each company sits relative to the group.
The Schema-First Design Decision
The most important design choice in the entire pipeline was the decision to define the output schema before writing a single prompt. Early experiments with free-form extraction produced outputs that varied in structure depending on how the annual report was formatted — some reports put revenue at the top of the income statement, others buried it inside segment breakdowns. Free-form extraction handled neither case consistently.
The schema locks the output into a specific JSON structure with typed fields and explicit null handling for values that do not appear in a given report. The model's job becomes classification and extraction into a known format, not generation of a variable structure. This made the downstream computation layer deterministic and made the output auditable — you can always trace a ratio back to the exact extracted value and the line item it came from.
FastAPI Backend and React Frontend
The backend is a FastAPI application that handles file upload, manages the extraction pipeline asynchronously, and serves results via a REST API. FastAPI was chosen for its native async support and Pydantic integration — the same schema used to validate the model output is also used to validate the API response, keeping the types consistent across the stack. The React frontend provides the upload interface, progress tracking, and a results view with sortable ratio tables and period-over-period trend charts.
Capabilities and Outcomes
The system handles multi-document analysis — up to ten companies simultaneously — and produces both a per-company deep-dive and a cross-company benchmarking view. It supports reports in English and, with prompt adjustments, handles reports formatted to different accounting standards (IFRS and US GAAP). The analysis that previously took a half-day per company now takes under five minutes per batch. The consistency benefit is arguably more valuable than the speed benefit: the same analytical framework applies to every company, every time, without the variation that comes from different analysts applying different conventions.