Capability 3 | Unstructured Data Extraction: Unlocking What's Trapped

Clinicians and administrators often expect their EHR to solve three core problems: organization, standardization, and interoperability. In practice, that expectation falls short.

A significant portion of patient data still moves between facilities as documents: PDFs and faxes of scanned or exported medical records, rather than structured, usable data. Even when information is "available," it is often locked in formats that are difficult to parse, search, or analyze. The result is a persistent gap between where healthcare data lives and how it can actually be used.

The status quo: the thousand page PDF

When a facility requests medical records from another organization, the expectation is simple: the data arrives neatly organized in the EHR, labs in the lab section, imaging in imaging, notes in the appropriate clinical folders.

In reality, that's not what happens. Instead, organizations often receive a large fax packet containing an entire patient history: encounters, labs, imaging, notes, and more, all concatenated into a single document that can span hundreds or even thousands of pages. What arrives is a monolithic packet, becoming a time-intensive and error-prone manual task.

Upon receipt, all of these records are effectively paper records, subject to all the quirks and variations that happen in that format. Some will be handwritten, others will be exported by the facility's EHR. The challenge is compounded by the fact that every EHR represents these records differently. There is no consistent structure, layout, or standard to these representations but instead there are a plurality of formats shaped by how each system renders its data.

The human layer leaves an interoperability gap

To make this usable, clinical staff must perform a manual process known as indexation. They review the document page by page, identify what each section contains, and upload it into the correct location in the EHR.

Indexation is a critical step toward making healthcare data usable. Traditional indexation results in the proper categorization of patient health data. Labs will end up in the labs folder, for instance. The challenge with this method, however, is that the data often still lives as PDFs of text or handwritten notes vs. data that can be combined together with other sources.

To be truly useful, this data must serve two purposes:

Human-readable, so clinicians can quickly understand and act on it
Machine-readable (e.g., FHIR-compatible), so it can power analytics, workflows, and downstream systems

Traditional indexation fails to meet this bar because data from one source cannot easily be combined with data from another; labs live on the source page where they were originally printed, so lab results from multiple facilities cannot be trended together in a single view, they are separated by encounter and trapped inside a PDF.

One key component of Predoc's data pipeline was purposefully built to solve these specific pain points: our indexation service. This service handles entire fax packets, splitting them into individual medical records; categorizing the contents of each individual medical record; and extracting the data in a way that's FHIR-compatible and human-readable with provenance back to the original source.

How Predoc's Curated Data Layer Solves for This

1. Data enrichment

Medical documents contain many different layout elements that make them hard to capture with a single representation. For example, a single page may contain:

Narrative text (e.g., provider notes)
Structured tables (e.g., lab results)
Images (e.g., scans, charts, embedded radiology images)
Handwriting (e.g., annotations, signatures)
Forms or checkboxes (semi-structured data)

As a result, traditional methods of document parsing such as OCR fall short due to their inability to capture the variety of context inside medical documents, resulting in degradation of downstream artifacts. Capturing the document representation faithfully is critical for ensuring that clinicians have both complete and accurate information. In the past, we relied on Optical Character Recognition (OCR) to extract the text in a document; we found this technique to yield low accuracy and miss a lot of relevant data in the document. Missing data at extraction can have enormous consequences when a clinician is trying to render a medical decision. A single lab value could make or break an entire diagnosis.

Rather than forcing medical documents into a single modality, we represent each fax packet using a custom, in-house format that captures not just the text, but also the visual elements and their precise location within the document. This creates a richer representation of the data, and one that preserves both textual and visual elements. It allows vision-language models to interpret the document more effectively, while also grounding every extracted element within the full structure of the original record.

2. Segmentation

The next task is to identify the individual medical documents in the fax packet. Out of the thousands of pages of faxed records, the information is not purely clinical. Coversheets, demographics, intra-office communications, and other administrative documents may be included, which need to be separated out at indexation.

Even the clinical information doesn't always arrive in the best shape. Documents may be out of order, or sections may be duplicated. In some cases, a single clinical document may span non-contiguous pages, or documents may even be nested within each other, for example: a discharge summary that includes embedded lab reports or imaging interpretations.

Segmentation helps separate administrative from clinical data, normalize complex document hierarchies, and resolve end-to-end document stitching to create a document-centric view of a fax packet. Imagine: 1000 pages separated out into piles based on what category of information is included (e.g., a pile of labs, a pile of progress notes, etc).

These documents end up being the physical segments we display in-platform and return to integrators categorized for EHR upload.

3. Classification, data extraction, and validation

The final step here is extracting the data. We have a several step process here that:

Identifies the relevant entities to extract (labs, progress notes, encounters, etc)
For each identified entity extracts the data, and
Verifies the validity of the data in the source document. If any issues are detected, we escalate that specific extraction to an expert for human review.

4. Output

The end result is structured patient data derived from the raw fax packet. This patient data is verified for authenticity and is human-auditable via our internal review service. This structured patient data can now be combined with other physical and digital records, creating a unified view of the patient's data.

Each item extracted from the fax packet is also tied back to the source record as a spliced document - that is, an EHR-ready segment of the document attached to an EHR-friendly category. Think: labs go in the "labs" folder, and so on. The good news: no clinical and administrative context is lost along the way. Lab values exist as tabular data, which can be combined and visualized alongside other lab values from other sources, with provenance to the original source document.

The Bottom Line

Indexation sits at the core of the curated data Predoc provides customers. During this step, we transform what arrives as a massive, unstructured fax packet into something usable. What would traditionally take clinical staff hours or days to accomplish before a patient visit, is automatically and autonomously done by Predoc, preparing usable patient data for other downstream workflows. Notably, each piece of data is also tied back to its original source, preserving context and auditability. The result is patient data that is immediately usable, both for clinicians reviewing records and for systems powering analytics, workflows, and AI, without the traditional administrative burden.

Capability 3 | Unstructured Data Extraction: Unlocking What's Trapped

The status quo: the thousand page PDF

The human layer leaves an interoperability gap

How Predoc's Curated Data Layer Solves for This

1. Data enrichment

2. Segmentation

3. Classification, data extraction, and validation

4. Output

The Bottom Line

More Insights

Capability 4 | Synchronization: Aligning the Patient Story

Capability 2 | Terminology Mapping: Speaking the Same Clinical Language

Want to Learn More?