Previously published on Nuclia.com. Nuclia is now Progress Agentic RAG.
How to get the most out of your content by automating custom data extraction with Nuclia AI Agents
Purpose
Extracting structured data from unstructured text is a common need in many applications. It allows you to get insights from your resources that are not directly accessible, and to use them in your applications.
For example, let's imagine you are managing professional training courses in your knowledge box. Each course is described in a text block, and you would like to extract structured data from these descriptions, like:
- Course title
- Domain
- Prerequisites
- Duration
If the structured data you need are always different and depend on the questions the user is asking, you can use the JSON Output option on the /ask
endpoint. But if you need to extract the same structured data from all your resources, you can use the JSON generationAI Agent.
That way, you can extract the structured data once and for all and use it in your applications, without having to recompute it each time.
This data can be used to render the search results in a more structured way, or to add interesting call-to-actions next to the generated answer.
It can also be used to improve the quality of your RAG pipeline by providing more context to the language model.
Usage
To generate extra text contents at ingestion time, you need to use the Generator agent from the AI Agents section of your knowledge box.
When creating a Generator agent, you will have to enable the Produce structured JSON data option and provide a JSON schema describing the data you want to extract from each resource. The result will be stored in a new field.
To extract the attributes mentioned in the training courses example, you would provide the following schema:
{
"name": "courses",
"description": "Information about the training course",
"parameters": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "The title of the course"
},
"domain": {
"type": "string",
"description": "What domain the course belongs to, like 'Data Science' or 'Management', etc."
},
"prerequisites": {
"type": "array",
"items": { "type": "string" },
"description": "The different prerequisites for the course"
},
"duration": {
"type": "number",
"description": "How many hours the course lasts"
}
},
"required": ["title", "domain"]
}
}
Remember the descriptions are very important because they will be used by the LLM to generate the structured data. If the descriptions are not accurate, the generated data might not be relevant.