Abstract background

Practical AI Practical AI Starts with the Work You Only Do Once

Why data preparation, curation and harmonization are the unsung economics of enterprise AI.

When organizations think about AI cost, they think about model pricing. Pennies per million tokens, dropping every quarter, easy to compare across vendors. So why is enterprise AI still so expensive to scale? Because the model is the cheap part. The expensive part is what you make the model read.

The hidden bill

Practical AI Starts with the Work You Only Do Once

Reading a single twenty-page PDF can cost an AI workload around 80,000 input tokens. The same content, prepared as semantically structured Markdown, costs around 4,000. The information is identical. What changed is the work done before the model ever saw it.

That work—extracting, cleaning, classifying, harmonizing and enriching—is what enterprise data teams call data preparation. It is the most underpriced lever in AI economics. And when it is done inside a governed platform, the savings compound across every workload that follows.

RAW PDF
80,000
Input Tokens Per Call: 4,000/Pg × 20 Pgs
Every call re-tokenizes every page. Every time.

Sample cost (Sonnet 4.6 input)  $0.24/Call

At 10,000 calls/month  = $2,400/Monthly

PREPARED MARKDOWN
4,000
Input Tokens Per Call: 200/Pg × 20 Pgs
Prepared once. Called many times. Cost amortizes.

Sample cost (Sonnet 4.6 input)  $0.012/Call

At 10,000 calls/month  = $120/Monthly

Most enterprise AI workloads are still ingesting raw documents at runtime: (1) every chatbot query that rereads a PDF; (2) every agent that re-parses a contract; and (3) every retrieval that re-tokenizes the same source page hundreds of times a week. And each call pays the full price, none of the cost is amortized.

Multiply that across multiple AI applications per organization, multiple users per application, multiple calls per user, multiple sources per call and multiple iterations as prompts evolve. The bill is not in the model—it is in the format.

three jobs, not one

Preparation, Curation, Harmonization

Three jobs sit underneath the phrase “data preparation,” and they each do different work.

  • Preparation: The Mechanical Layer

    Converting PDFs to Markdown. Stripping boilerplate. Normalizing tables. Optical Character Recognizing scanned pages. Removing the parts no model should ever spend tokens reading. This is where the raw token reduction happens.

  • Curation: The Editorial Layer

    Deciding which sources matter, which versions are canonical and which should be retired. This is what stops AI from confidently quoting last year’s policy or a deprecated SOP. It is the difference between “The model has access to everything” and “The model has access to what is true.”

  • Harmonization: The Semantic Layer

    Reconciling the same concept expressed three different ways across five different systems. Aligning vocabularies. Linking entities. Building the relationships between things that retrieval can actually traverse. This is what makes results sharp instead of scattershot.

Done together, these three jobs convert raw enterprise content into a trusted context layer—the kind of input that produces defensible answers at a fraction of the cost.

“The first user pays the conversion cost; every workload after that benefits from the result.”

Convert once. Serve many. Govern always.

Here Is the Principle That Changes the Economics, Curation & Harmonization

When data preparation happens once, inside the platform, every subsequent AI workload calls the prepared version—not the source.

RAW SOURCES Documents · PDF · Structured Data · 
 Web & Intranet PROGRESS DATA PLATFORM The Context Layer 1 Prepare Convert · Clean · Normalize · OCR Enrichment 2 Curate Canonical · Versioned · Trusted Context Engine 3 Harmonize Semantically Enriched · 
 Entity-linked · Governed Governance Rules Done Once. Shared Across the Enterprise. AI CONSUMERS Canonical Context Markdown · Semantic graph · Governed Retrieval Chatbots Customer · Employee Agents Workflow · Autonomous Search & Q&A Internal · External Retrieval RAG · Semantic Decision Support Underwriting · Triage Governance & Audit Compliance · Review
First User Pays the Conversion Cost
Every workload after that reuses the prepared context. Cost amortizes. Trust travels with the data.
Linear Cost → Amortized Cost
Per-call expense becomes a platform investment. The math flips at scale.

The first time a contract is ingested, classified, semantically enriched and exposed via governed retrieval, the work is real. After that, every chatbot, agent, search, summary or compliance check that needs that contract calls the cheap, semantically rich version. The PDF is read once. But the answers it powers are produced thousands of times.

This is the difference between treating AI as a per-call expense and treating AI as a platform investment. One scales linearly with usage. The other amortizes. At enterprise volume, the gap between the two curves becomes the difference between an AI program that scales and one that stalls.

Example

A Concrete Example

Take a single 20-page regulatory policy document, the kind of source that a half-dozen workloads need to read every day: a customer-facing chatbot, an internal Q&A agent, a compliance audit assistant, a legal review tool, an analyst summarizer or an onboarding guide.

800M Tokens

Naïve

Reread the PDF on every call.
 

  1. Each call ingests the raw PDF

    No reusable preparation layer · No governed cache

  2. ~80,000 input tokens/call

    4,000 tokens per page × 20 pages

  3. Cost scales linearly with usage

    No amortization · Every call pays full price

$2,400/Monthly
40M Tokens

Progress Data Platform-Prepared

Convert once. Every call reads the prepared context.

  1. Single ingestion through Progress Data Platform

    No reusable preparation layer · No governed cache

  2. ~4,000 input tokens/call

    4,000 tokens per page × 20 pages

  3. Reused by every downstream workload

    No amortization · Every call pays full price

$120/Monthly

At 10,000 calls per month across the organization, the naïve path consumes roughly 800 million input tokens and around $2,400 per month. The Progress® Data Platform-prepared path consumes 40 million tokens and around $120 per month. This equals a 20X reduction on input alone, on a single document, before you have considered the further savings from sharper retrieval, lower remediation or less human review.

“A 20X lower input bill on a single document...and that is before sharper retrieval, lower remediation or less human review even enter the picture!”

That is one document. Now multiply by the catalog of policies, contracts, manuals, articles and records that any enterprise actually runs against AI.

The Problem

Why This Is a Platform Problem, Not a Pipeline Problem

You can prepare data with scripts. Many organizations do. But scripts produce brittle outputs that one team uses and three teams duplicate. The savings stay local. The trust does not travel. The next AI use case rebuilds the pipeline from scratch—sometimes in the same week, by a different team, against the same source.

A platform changes that. One pipeline produces canonical outputs that every AI workload can rely on. Governance, lineage and access controls travel with the data. Updates propagate automatically — no stale copies, no version drift. Semantic enrichment is reused, not re-derived, every time a new use case ships.

That is what the Progress Data Platform is built to do. Progress® MarkLogic® software provides the multi-model context engine. The Progress® SemaphoreTM platform handles classification and semantic enrichment at ingestion. The Progress® Corticon® decision management system governs the rules around access, quality and routing. Orchestration Studio operationalizes the pipeline. And together, they convert preparation from a per-team script into a per-enterprise asset.

The economics are only the first reason this matters. The deeper one is that the same prepared, governed, semantically enriched context is what makes AI outputs defensible, accurate enough, explainable enough and audit-ready enough to act on. Cheaper tokens are a side effect. Better answers are the point.

Three Questions for Your Next AI Economics Review
  1. How Many Times Do We Re-ingest the Same Content?

    If the answer is ‘we don’t know,’ your AI bill is bigger than it needs to be.

  2. Who Owns the Prepared data Layer? 

    If it sits inside individual AI applications, you are paying for preparation many times. If it sits inside a platform, you are paying once.

  3. What Is Our Cost Per Defensible Answer, Not Per Call?

    The number that matters is not tokens-in. It is whether the output is trusted enough to act on and what it costs to produce.

What to do next

Is Your AI Program Scaling—or Stalling?

The lever is rarely the model, it is the work done before the model. Open the calculator and run your own numbers. Continue the deep dive with The cost-per-defensible-answer formula for the full economic model. Or speak to the Progress Data Platform team about a workshop on your specific data preparation surface area.

FAQs

Cost-Per-Defensible-Answer = The Full Formula

Compute, retrieval, remediation and human review across seven enterprise use cases and ten current models. The strategic case for end-to-end trusted context.