“Pick one LLM and build everything around it.” That mantra describes how most agentic Retrieval-Augmented Generation (RAG) deployments get started. A single vendor relationship simplifies everything. It feels like a reasonable shortcut until the model your pipeline depends on gets deprecated on a Tuesday. (It’s always a Tuesday.) New vectors are mathematically incompatible with the old ones, so re-embedding becomes a full migration project while your knowledge base sits incoherent.
Token economics in agentic pipelines behave like compound interest in reverse. The proof-of-concept bill is a rounding error; production, with multi-step reasoning loops at real query volume, has its own invoice line. Agentic RAG is iterative. Agents plan, retrieve, grade documents, decide whether the context is sufficient, loop back when it isn’t, then synthesize across sources and check their own outputs. That workflow can generate multiple discrete model calls per user query.
The less obvious tax is running a frontier model on every step when most don’t need one. Query decomposition is pattern matching. Document grading and hallucination checking are structured comparisons against retrieved context. Each of those steps costs the same per-token rate as your final synthesis call, and that’s where single-model pipelines quietly drain budget without improving quality.
LLM flexibility changes this directly. Consider what Progress Agentic RAG supports today:
Here’s a rough per-query cost comparison (illustrative, based on current provider pricing from OpenAI, Anthropic and Google):
| Experience | Single Frontier Model | Paired Configuration |
| External portal (summary + results) | ~$0.09/query | ~$0.04/query |
| Internal knowledge portal | ~$0.09/query | ~$0.01/query |
That’s a nine-times difference on the internal portal, the same quality your employees won’t notice if you downgrade.
Reality Check: Model flexibility, in the sense that matters right now, means choosing the right model per experience and per component. Automatic per-query routing that dynamically picks a model based on sensitivity inside a single pipeline is a different, more complex capability. Progress Agentic RAG supports model choice per experience today. Build for that; position your architecture for what comes next.
Cost gets the budget conversation. Quality is what keeps agentic systems running in production.
No single model dominates every benchmark. Rankings at the LMArena leaderboard reorder with every major release. A model that leads on creative synthesis may underperform on strict structured output. One fine-tuned for clinical notes may outperform general-purpose models on healthcare retrieval but struggle with code generation.
In a hardcoded single-provider pipeline, you get whatever your vendor decided their model should excel at this quarter. That’s fine until your structured data extraction step returns prose when your downstream application expected JSON. The failure doesn’t throw an exception. The generator returns conversational text that your application happily processes as valid until something breaks visibly at 2 a.m., long after a batch job has produced malformed outputs for hours. (Your on-call rotation definitely doesn’t include the model provider’s support line.)
Progress Agentic RAG supports dozens of LLMs via OpenAI-compatible APIs, across cloud-hosted, open-source, self-hosted and region-pinned deployments. Match model strengths to task requirements rather than forcing task requirements to fit whichever provider you onboarded first.
Most teams think about model lock-in at the application layer. The deeper trap is in the data layer.
Your embedding vectors, the numerical representations that power semantic retrieval, only work with the model that generated them. When OpenAI superseded text-embedding-ada-002 with text-embedding-3-small (one-fifth the price, better accuracy), teams that depended on the legacy model faced a full re-embedding migration. New vectors are not mathematically compatible with old ones. You cannot partially migrate. Every document re-embeds from scratch. Teams that kept only the vectors, not the source documents, found out what “crisis” means in this context.
Four practices prevent this from becoming your problem:
Key Insight: The cutover window is the real cost of embedding lock-in. It isn’t the re-embedding compute; it’s running parallel vector stores during migration and validating retrieval quality against the new embedding space without degrading production.
In regulated industries, model flexibility stops being a preference and becomes a legal requirement.
The Health Insurance Portability and Accountability Act (HIPAA) requires any service processing Protected Health Information (PHI) to sign a Business Associate Agreement (BAA). Several consumer AI platforms explicitly prohibit PHI in their terms of service. If your agentic RAG pipeline retrieves documents containing PHI, and at scale that’s essentially guaranteed, the model handling inference must come from a BAA-covered provider. The non-obvious failure mode: PHI lives in the retrieved chunks, not just the user’s query. Classify both sides.
Financial services teams face parallel constraints under FINRA 17a-4 and GDPR Articles 44–49. An EU customer’s query routed to a US-hosted model may be a non-compliant cross-border transfer even if the underlying data never crosses the Atlantic, because the model itself processes personal data during inference. The auditor knows this. Your architecture should too.
Progress Agentic RAG lets you pin a given experience to a self-hosted or region-specific model through configuration: route EU user traffic to an EU-hosted endpoint, route PHI-bearing knowledge bases to a self-hosted deployment. Choose each experience’s model based on who uses it and what data it touches, then pick providers whose infrastructure and data processing agreements fit your compliance posture. Choosing flexibility at design time is a sprint; retrofitting after an audit finding is a career event.
The fastest way to find where model flexibility pays off is to look at your current spend by experience. Which surfaces are billing you for frontier quality your users can’t perceive? Which components are running pattern-matching tasks at synthesis prices?
Start with the Progress Agentic RAG model reference and map each experience to the model tier its traffic justifies. Build for optionality now, or rehearse the meeting you’ll have the next time a deprecation notice lands.
It means you can swap LLM providers/models without rewriting your retrieval and orchestration logic, and you can choose different models for different parts of the experience (for example, a higher-quality model for final synthesis and a cheaper model for lightweight steps like snippets, grading, or classification). In practice, this is usually implemented via OpenAI-compatible endpoints and/or an abstraction layer (LangChain/LlamaIndex/custom wrapper) so ”model choice" is configuration, not code surgery.
Because your vector store is only comparable within a single embedding space. If you change embedding models, you generally can’t mix old and new vectors and expect retrieval to work—so you’re facing a full re-embed plus a migration/validation window. The real risk isn’t the compute cost; it’s the operational cutover: running parallel stores, validating retrieval quality, and avoiding production degradation while you migrate.
Pay for it when the user-visible output depends on nuanced synthesis (external, customer-facing answers; sensitive or high-stakes outputs). It’s often wasteful for ”structured” or repetitive steps inside the agent loop (query decomposition, document grading, link descriptions, hallucination checks) where cheaper models can perform adequately. Splitting models by experience (external vs internal portal) and by component (summary vs snippets) is usually the fastest way to reduce spend without hurting perceived quality.
Adam Bertram is a 25+ year IT veteran and an experienced online business professional. He’s a successful blogger, consultant, 6x Microsoft MVP, trainer, published author and freelance writer for dozens of publications. For how-to tech tutorials, catch up with Adam at adamtheautomator.com, connect on LinkedIn or follow him on X at @adbertram.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites