Comparing RAG Evaluation Models: Nuclia Llama-Remi-v1 Vs IBM Granite-Guardian 3.1

Previously published on Nuclia.com. Nuclia is now Progress Agentic RAG.

Even though REMi v0 was the first of its kind, since then other models of similar size have been released that have Retrieval-Augmented Generation (RAG) evaluation capabilities. One of those models is IBM Granite-Guardian 3.1, which can evaluate multiple risks, including answer relevance, context relevance and groundedness in a RAG context.

In our internal testing, the RAG evaluation provided by the Granite-Guardian models do not perform as well as REMi, with scores that depart from human judgment more often than our latest model. Let’s exemplify the difference between REMi-v0, Llama-REMi-v1 and Granite-Guardian in its biggest size through two example RAG interactions.

The code used to run the evaluation for the Granite-Guardian 3.1 8B model is taken from the official IBM examples here. For the sake of easier comparison, we have inverted the score, using close to zero if not relevant or not grounded, versus the IBM measure which uses close to one if not relevant or not grounded.

These interactions have been created just for this article and are not contained on any of our training datasets. We present human evaluation scores as ranges to accommodate the inherent variability in human assessments and ensure flexibility in interpretation.

Example RAG Interaction 1

Question	“What equipment is essential for indoor bouldering?”
Answer	“The essential equipment for bouldering includes climbing shoes, a chalk bag, and, if bouldering outdoors, a bouldering pad. Additional useful items include a piece of carpet or matting to keep shoes clean, a tarp to protect the pad in wet conditions, and a guidebook for navigation.”
Contexts	“Bouldering doesn’t require much equipment, to climb indoors all you need are climbing shoes and maybe some chalk.” “One of bouldering’s many attractions is that it doesn’t require a lot of equipment. All you need are climbing shoes, a chalk bag and, if you boulder outdoors regularly, a bouldering pad.” “There are a few other less essential but useful items such as: a small patch of carpet or doormat to keep your shoes clean and dry before you step onto a problem, a tarp to protect your pad in wet, sandy or muddy conditions and a guidebook to help you find your way around the boulders.” “If you are venturing off the beaten track, you should also take a map and compass, food, water, warm clothes and a head torch.”

In this second evaluation, Llama-REMi-v1 once again proves to be the model most closely aligned with human judgment, particularly in multilingual contexts. It demonstrates a clear advantage across the board, outperforming both REMi-v0 and Granite-Guardian 3.1 8B. The latter exhibits notable weaknesses, especially in evaluating multilingual contexts and in failing to assign a high relevance score to a perfect answer presented in list form.

Example RAG Interaction 2

Question

“What are the specs of the OP-1 display?”

Answer

“The specs of the OP-1 display are:

AMOLED display running at 60 fps
320 x 160 pixel resolution
Color Depth: 16.7 M
Contrast: 10000:1 (good for outdoor use)
Viewing Angle: 170°
Life Time: 30,000 h”

Contexts

“OP-1 Display\n\n• AMOLED display running at 60 fps • 320 x 160 pixel resolution • Color Depth: 16.7 M • Contrast: 10000:1 (good for outdoor\n\nuse)\n\n• Viewing Angle: 170° • Life Time: 30,000 h • 1800 mAh li-ion Polymer Battery”
“To power on your OP-1, slide the white power switch located on the right side of the device towards you. The display will light up and the OP-1 loads necessary system data.\n\n1.4 OP-1 Side view\n\nFor strap\n\nTo power oﬀ, slide the power switch away from you.”

“La pantalla del OP-1 és super colorida, té una alta resolució de 320 x 160 píxels i un molt bon ratio de contrast”

“El nivel de contraste que ofrece la pantalla es increible, muy fluida con una velocidad de refresco de 60 fps, es muy sorprendente la calidad de imagen que se puede obtener con este dispositivo y además tiene una vida útil de 30,000 horas. Este OP-1 es una maravilla, sin duda lo recomiendo”

“2\n\nLayout\n\n2 Layout 2.1 Keys & knobs The layout of the OP-1 is divided into diﬀerent groups for easy reading and intuitive workﬂow.\n\nTurn the Volume knob to set the master volume.\n\nThe four color encoders are related to the graphical interface on the display.”

In this second evaluation, Llama-REMi-v1 once again proves to be the model most closely aligned with human judgment, particularly in multilingual contexts. It demonstrates a clear advantage across the board, outperforming both REMi-v0 and Granite-Guardian 3.1 8B. Granite-Guardian exhibits notable weaknesses, especially in evaluating multilingual contexts and in failing to assign a high relevance score to a perfect answer presented in list form.

Comparison

When comparing the RAG evaluation of Llama-REMi-v1 with Granite-Guardian 3.1, key differences emerge in the data strategies and the interpretability of result. This showcases the unique strengths of Llama-REMi-v1:

Production-Aligned Data vs. Standard Data:
- Granite-Guardian’s RAG training dataset relies exclusively on synthetic data generated from curated datasets such as HotPotQA, SQuADv2, MNLI, and SNLI. While these datasets are ideal for controlled experimentation, they generally do not accurately reflect the messy, diverse and unstructured nature of real-world data.
- Llama-REMi-v1’s training dataset, although it includes synthetic data, contains context pieces that have been extracted and chunked directly from real documents, including markdown-structured text. This ensures the model has seen data that mirrors the complexity of production environments.
Multilingual Capability:
- With its English-only training data, IBM Granite-Guardian 3.1 has a limited ability to serve multilingual or international use cases effectively.
- Llama-REMi-v1’s multilingual training data equips it for a wider range of applications, catering to global audiences and supporting diverse linguistic environments.
Interpretability of Metrics:
- IBM Granite-Guardian 3.1 consolidates its evaluation into a binary result with the option obtain a score for each metric, which is done through the token decoding probabilities of the risk or safe token, which may lack granularity.
- Llama-REMi-v1 takes a client-centric approach by not only giving discrete scores for each of the metrics, but also providing a detailed reasoning for the answer relevance score. This makes results more interpretable and actionable, empowering clients to better understand the areas for improvement in their RAG pipelines.
By embracing real-world-aligned data, multilingual support and a more interpretable evaluation, Llama-REMi-v1 by Nuclia stands out as a practical and client-focused solution. These strengths make it more adaptable and robust compared to the controlled but limited approach of IBM Granite-Guardian 3.1.

Agentic RAG

Comparing RAG Evaluation Models: Nuclia Llama-Remi-v1 Vs IBM Granite-Guardian 3.1

Example RAG Interaction 1

Example RAG Interaction 2

Comparison

Carles Onielfa