Tell your team to find the exact day quality of your AI assistant’s answers started declining. Not “feels off lately” — the day Context Relevance (the score for whether the right source content is being retrieved) started dropping. No Slack archaeology, no re-running last quarter’s test queries, no asking Marcus what he changed last Tuesday, no diff-grepping the staging configs. Five minutes. We’ll wait.
Now ask them to pinpoint the day Groundedness (the score for whether answers are actually supported by the retrieved content) started looking suspiciously high while Context Relevance was already underwater — the symptom that means your AI is confidently citing the wrong sources. Same clock. The CIO and the enterprise architect both need that answer. So does the platform engineer who has to fix it. Progress Agentic RAG, the retrieval-augmented generation platform with multi-step reasoning, makes both questions answerable in minutes — but only if your team is using its labs and evaluation tooling.
Agentic RAG introduces more configuration levers than a basic vector-search-plus-LLM setup, and those levers interact in unexpected ways. You’re not tuning one variable. You’re operating at the intersection of several, including the:
Each can improve one metric while silently degrading another.
The failure mode worth planning for isn’t the obvious one. It’s high Groundedness with low Context Relevance: a system that answers confidently from the wrong source content. The retrieved text blocks (the chunked passages an Agentic RAG system pulls from your Knowledge Box and feeds to the LLM) are faithfully quoted; they just aren’t the blocks the user actually needed. Your LLM doesn’t know the difference. Neither does the user. That misconfiguration can run for weeks undetected.
Reality Check: “It tested fine” isn’t a QA strategy when what you tested is a different configuration than what shipped.
Without a lab, a typical iteration loop looks like this. Marcus on the platform team swaps the embedding model, two test queries run in staging before the Jira ticket closes. Four days later the CIO is asking why the knowledge assistant gave a customer the wrong contractual terms. The branch-test-redeploy cycle takes days and leaves no record of what worked.
That’s what the Prompt Lab and RAG Lab address, less as convenience and more as the mechanism for knowing what you’re actually deploying.
The generation half of your RAG pipeline has its own failure surface, separate from retrieval. A prompt that works on one model will often fail on another with no change to the underlying instructions. Different tokenizers, different post-training data, different reinforcement-learning histories, different default behaviors with the same prompt can produce materially different output. The gap between two models on the same prompt cannot be predicted on paper. It has to be measured by running both side by side.
The Prompt Lab gives teams an isolated environment to make that comparison explicit. Switch between Google Gemini, OpenAI, Anthropic, Mistral, Llama and DeepSeek in one click, with real data from your Knowledge Box, using the same retrieved context and your configured system prompt. Anything else your team already runs connects through Progress’s OpenAI-compatible LLM bridge. Teams discover this too late, after deployment: a prompt calibrated for one frontier model produces evasive, over-hedged output on another. The Prompt Lab surfaces that before it becomes a platform-team emergency.
Configurations that pass in the lab deploy directly. What you tested is what ships. No drift between the validated version and the live one, which sounds like a low bar until you count how many production AI incidents trace back to exactly that gap.
Retrieval is the upstream constraint on everything else. If it surfaces the wrong text blocks, even slightly wrong, no amount of prompt tuning will fix the output. You’re decorating a bad foundation.
The RAG Lab exposes your retrieval pipeline as a set of testable variables you can tune side by side: hybrid search configurations combining semantic and keyword matching, context expansion (pulling in neighboring text blocks around each match), metadata enhancement parameters and reranking approaches. It runs multiple configurations in parallel without rebuilding infrastructure between tests. What once took weeks (spin up a configuration, run it, tear it down, spin up the next) now happens in hours.
Pro Tip: Context expansion sounds minor until your retrieval is consistently landing on the sentence before the answer rather than on the answer itself.
A passing RAG Lab configuration can be saved and reused across every application sharing the same Knowledge Box. The retrieval strategy that works for your internal knowledge assistant also works for customer-facing search without re-tuning from scratch.
The labs give you a place to iterate. REMi, Progress’ RAG evaluation model, tells you which lab to return to. Without it, evaluation is qualitative: “this seems better” based on whichever test queries Priya ran last Thursday. That’s intuition dressed up as process.
REMi scores every interaction across the RAG Triad. Two vectors (Context Relevance and Answer Relevance) score the front and back of the generation step. Groundedness scores the connection between them. Each maps to a remediation path:
You’re not guessing at root cause. The scores tell you where the problem lives.
Per-interaction scoring becomes practical at scale: continuous monitoring rather than a weekly audit. Your platform team sees regressions while they’re small, not four days later when a customer complains.
Key Insight: REMi doesn’t just report a score. It provides reasoning for the Answer Relevance metric, so your engineers can read why a response missed, not just that it did.
For those building on Progress Agentic RAG, the lab-plus-evaluation loop reframes what “production-ready” means. Production-ready becomes a practice: retrieval and generation under continuous observation, with a structured path back to the lab when metrics shift.
Here’s the move worth making this week. Open the Prompt Lab against your current Knowledge Box, run your top production prompts against a second frontier model and read the score diff before your next standup. The teams whose Slack threads aren’t full of “did anyone change anything last week?” are the ones using the labs.
Labs aren’t a launch checklist. Treat them as ongoing infrastructure. Knowledge-base changes pull you back in, as does adding a new AI experience, drifting REMi scores, a quarterly LLM model release or a new prompt pattern your support team wants tested. Plan for regular iteration cycles.
A passing RAG Lab configuration can be saved and deployed across any application sharing the same Knowledge Box. Your platform team tunes once and propagates, but only when the applications have similar query patterns. An internal legal search and a customer-facing product assistant may need different Top-K settings and reranking approaches even when they share data.
Model updates are among the most common causes of silent regressions: a new version ships and behavior shifts in ways nobody tested. Route the updated model through the Prompt Lab before it touches production, comparing output against the same test queries and benchmarks you used for the previous version. Catching drift in the lab is always cheaper than catching it in a support ticket.
Adam Bertram is a 25+ year IT veteran and an experienced online business professional. He’s a successful blogger, consultant, 6x Microsoft MVP, trainer, published author and freelance writer for dozens of publications. For how-to tech tutorials, catch up with Adam at adamtheautomator.com, connect on LinkedIn or follow him on X at @adbertram.
Subscribe to get all the news, info and tutorials you need to build better business apps and sites