Capability 2: Terminology Mapping — Speaking the Same Clinical Language

Healthcare data doesn't just vary in format, it varies in language. The same concept can be represented across: ICD-10, SNOMED-CT, LOINC, CPT, and with many different iterations on spelling, abbreviations, and free-text variations.

Terminology mapping ensures that all of these variations resolve to a standardized clinical concept. Consider Hemoglobin, for example. It might be represented as Hemoglobin, Hgb, Hb, HgB, Hemoglobin A1c, HbA1c, or even a misspelling like hemaglobin.

In structured systems, it may also map to different LOINC codes (Logical Observation Identifiers Names) depending on whether it's a standard hemoglobin vs. A1c, the specimen type, the measurement method. Without proper mapping, these can be treated as entirely separate concepts.

This is Capability #2 in building a curated data layer: Not just extracting data, but making sure it speaks a consistent clinical language, per USCDI (US Core Data for Interoperability vocabulary standards).

Because without that, systems duplicate what should be unified, analytics miss what should be obvious, clinical and financial workflows fall out of sync.

Most healthcare data pipelines don't solve this — they either assume the data already comes clean, or ignore the duplicates.

But in reality, terminology is where fragmentation begins; if you don't fix it here, you carry that inconsistency all the way through the system.

Data Without a Shared Language

Take a simple medication: Metformin, a medication used to treat Type II Diabetes. Across different sources, the related information about Metformin might appear like this:

In a provider note (unstructured text): "Patient continues metformin 500 twice daily"
In another document: "Metformin HCl 500 mg BID"
In a medication list: "Glucophage 500mg" (the brand name)
In a pharmacy feed: NDC Code for a specific package of Metformin
In a standardized system: Normalized to RxNorm code for ingredient, clinical drug, or branded drug concept
And in some cases: Just "Metformin," with no dosage, no frequency, and no code

A properly mapped system can often normalize these different representations to the appropriate RxNorm concept level, allowing related medication data to be grouped and interpreted consistently across sources. When the source includes enough detail, that normalization may resolve to a more specific clinical drug concept. When it does not, the system may only be able to normalize to a broader ingredient-level concept. For example, an NDC may identify a packaged product, while RxNorm may represent the ingredient, a branded drug, or a more specific clinical drug at a different level of granularity. Free text may indicate that a patient is taking metformin, but it does not always provide enough information to support a fully specified regimen with confidence.

Without terminology mapping, these may be treated as different medications (brand vs generic), separate entries (due to formatting differences), or incomplete records (missing codes or structure).

What's missing from most annotations is the standardized clinical code — the identifier that tells every downstream system exactly what that medication is. And that's a problem, because in modern healthcare, data without standardized codes is data that loses meaning.

Interoperability frameworks such as USCDI strongly favor the use of standard vocabularies such as RxNorm, LOINC, SNOMED CT, and ICD-10 in the appropriate contexts. In practice, the more clinical data can be anchored to recognized coding systems, the more reliably organizations can exchange, reconcile, and analyze it across systems.

Most Systems Don't Solve This

Many downstream systems — EHRs, data lakes, interoperability layers, analytics tools — are designed to work best when the incoming data are already structured and coded. In other words, they assume the data already comes clean. They rely on EHR-generated data with embedded codes or HIE feeds that are already standardized.

The challenge with EHR-generated data is that it often arrives in CCDA format, which is XML with extensive tagging. CCDA format is optimized for exchange, but not for querying; it's deeply nested and complex, and querying it relies on understanding and navigating those complexities. In addition, EHR-generated data will often contain PDFs, which are binary files filled with unstructured or semi-structured text that cannot be queried.

The assumption of clean data breaks immediately when you introduce PDFs from prior providers, faxed documents, and free-text clinical notes.

In these cases, the data are rich in description, but they are missing standardized codes. Many systems stop at acquisition of the data and will either store it as-is (unstructured, unstandardized), or attempt shallow normalization without true clinical mapping. Ask any physician who has searched a PDF for a lab value embedded in a progress note.

How Predoc's Curated Data Layer Solves for This

At Predoc, terminology mapping isn't a single step — it's a layered system.

1. Code Inferencing: Creating Structure Where None Exists

A large portion of extracted data, especially from documents, does not come with codes. The first challenge is whether we can infer the correct clinical code from the available information.

This happens in three steps:

Deterministic Matching: If the data is complete (e.g., medication name + dosage form), it can be matched directly to a known code. This is known as deterministic matching.

Probabilistic / AI-Assisted Matching: If the data is incomplete or inconsistent, the system identifies the best candidates and uses models to determine the most likely match.

Clinical Context Application: In more ambiguous cases, additional clinical context — such as data type, adjacent notes, units, specimen, care setting, etc. — is applied to refine the mapping.

Each step builds toward a confidence score, ensuring that only high-confidence mappings are used externally. If the system is not confident, it does not invent certainty. This is critical in healthcare, where incorrect data is often worse than missing data. Deterministic matching, for instance, is associated with the highest confidence score.

2. Internal vs External Truth: A Subtle but Critical Distinction

One of the most important nuances in terminology mapping is that not all inferred data are treated equally.

High-confidence mappings can be used for interoperability
Lower-confidence mappings are used internally for data improvement

For example, if a medication is inferred from a document, and the same medication appears in an HIE feed with a confirmed code, the inferred code can be used to identify duplicates and remove redundancy, which improves data quality.

The key principle is not to present low-confidence inferences as if they were source-confirmed facts. Strong systems preserve provenance, distinguish documented values from inferred normalizations, and apply confidence thresholds before using mapped concepts in external workflows. Data aggregators operate as exchangers of information, so any low-confidence clinical code that we infer is used exclusively internally for data enhancement and aggregation purposes.

Terminology mapping powers data aggregation and deduplication behind the scenes — which we will address in our data synchronization capability deep dive.

3. Crosswalks: Translating Between Coding Systems

Even when codes exist, there's another challenge: different systems use different coding standards.

For medications alone, you might see NDC (FDA) or RxNorm; for labs, LOINC; and for conditions, ICD-10 or SNOMED.

Terminology mapping must handle crosswalks: translating one coding system into another and normalizing everything to a standard (e.g., RxNorm for medications).

Predoc handles this in two primary ways:

Leveraging direct mappings published by coding system maintainers — Many standards bodies provide official crosswalks (e.g., mappings between NDC and RxNorm). These serve as the first and most reliable layer of translation.
Inferring deterministic mappings based on shared properties — When direct mappings are incomplete or unavailable, codes can be translated by matching underlying attributes — such as substance name, dosage form, and strength — to identify the equivalent concept in another system.

This ensures that:

Data from different sources can be combined
Queries return complete results
Interoperability is supported across systems that speak different vocabularies

Without crosswalks, even "coded" data remains fragmented.

Why This Matters Clinically

From a clinical perspective, terminology mapping is about accuracy and completeness.

Terminology mapping is the difference between "We think this patient is or was on this medication" and "We know exactly what this patient is taking, across all sources."

And when these data are used as part of a curated data set — available not just in the EHR, but also in a data warehouse — its utility to revenue cycle, care management teams, analytics teams, and others within a health organization is amplified.

Healthcare data doesn't just need to be extracted, it needs to be understood in a standardized clinical language.

Without terminology mapping, everything built on top — clinical care, analytics, revenue cycle — starts to break.

Terminology mapping isn't just a feature. It's the backbone of usable healthcare data.

Capability 2: Terminology Mapping — Speaking the Same Clinical Language

Data Without a Shared Language

Most Systems Don't Solve This

How Predoc's Curated Data Layer Solves for This

1. Code Inferencing: Creating Structure Where None Exists

2. Internal vs External Truth: A Subtle but Critical Distinction

3. Crosswalks: Translating Between Coding Systems

Why This Matters Clinically

More Insights

From Raw Records to Clinical Intelligence: Why Health-Native Data Transformation Matters

Why Healthcare Needs a Curated Data Layer

Want to Learn More?