3 Ways to Supercharge Your Document Processing with New Extract Strategies

August 20, 2025 Agentic RAG, Data & AI

Previously published on Nuclia.com. Nuclia is now Progress Agentic RAG.

In today’s information-rich world, tapping into the most valuable knowledge within an organization can still be a challenge. It’s locked in the images of a product catalog, scattered across a multi-page table in a financial report, or split between diagrams and charts in a dense research paper. Standard extraction tools or basic Retrieval-Augmented Generation (RAG) pipelines can only get you so far, often missing the nuance and context that’s critical for your business.

What if you could give your AI a specific set of instructions for each document type? What if you could tell it not just what to extract, but how to interpret and structure it? What if you could finetune the way your RAG extracts and indexes your data?

That’s exactly what you can do with our the Nuclia RAG Extract Strategies. This feature gives you granular control to build reusable, AI-powered rule sets that transform your unstructured documents into clean, searchable and analysis-ready data.

Let’s explore three powerful ways you can put this feature to work today.

1. Turn a Product Catalog into a Sales Assistant

The Challenge: You have a 100-page PDF catalog of your products—in this case, high-end electric guitars. While visually stunning, the images are opaque to your search system. Your sales team can’t search for “a vintage-style guitar with a rosewood fretboard” and find the right page because that information is purely visual.

In this case, traditional image description does not cut it, because we need something more customized, we need an expert.

The Strategy: We’ll use Visual Extraction to have a visual model analyze and describe each page of our catalog just as if it was a guitar expert.

The Implementation: In the Nuclia dashboard, under AI Models > Extraction, you can create a new strategy. The configuration is simple:

{
	"name": "Expert Guitar Vendor Descriptions",
	"vllm_config":{
		"llm": {
			"generative_model": "chatgpt-azure-4o"
		},
		"rules": [
		"For every image of a guitar, describe it as an expert guitar vendor would for a customer. Mention the body style, finish, color, wood type (if visible), pickup configuration, and overall aesthetic (e.g., vintage, modern, relic'd)."
		]
}}

The Supercharged Result: When you process your catalog with this strategy, every guitar image is enriched with a detailed, expert description. The page with a photo of a Gibson Les Paul no longer is just a guitar, it’s also a searchable text like:

“The image shows a Gibson Les Paul-style electric guitar with a beautiful tobacco sunburst finish. The guitar features a single-cutaway solid body design with a flamed maple top that displays rich amber tones in the center gradually darkening to deep brown edges. It has the classic dual humbucker pickup configuration with cream-colored pickup covers, and four control knobs for volume and tone adjustments. The guitar is equipped with a Tune-o-matic bridge and stopbar tailpiece in gold hardware. The neck has a bound ebony or rosewood fretboard with distinctive trapezoid inlays and appears to have a set-neck construction. The headstock displays the “Gibson” logo and has gold tuning machines. This instrument embodies the classic, premium aesthetic that has made this style of guitar a staple in rock, blues, and jazz for decades.”

Your internal search is now a powerful sales tool.

2. Conquer Complex Financial Reports with AI Tables

The Challenge: You need to analyze the quarterly performance data from a dozen competitors. The data is locked in their 10-K filings inside tables that span multiple pages, use parentheses for negative numbers and have complex, multi-level headers. Manually copy-pasting this data into a spreadsheet is a recipe for disaster. And standard file processing does not fully cut it.

The Strategy: We’ll use AI Tables, designed specifically to understand and reconstruct complex tabular data. We’ll tell it to merge pages and clean up the data on the fly.

The Implementation: This strategy focuses on detecting tables and applying rules to standardize them. The merge_pages parameter is key here.

{"name": "Annual Report Table Consolidation",
	"ai_tables": {
		"llm": {
			"generative_model": "gcp-claude-3-7-sonnet"
		},
		"merge_pages": true,
		"max_pages_to_merge": 2,
		"rules": [
		"Consolidate all rows into a single, coherent table.",
		"If a monetary value is enclosed in parentheses, convert it to a negative number (e.g., '(1,234)' becomes -1234).",
		"Ensure all figures are represented in millions."
		]
	}
}

The Supercharged Result: The AI detects the table, follows it across the page break, and automatically merges it. It correctly interprets the accounting notation for negative numbers and standardizes the figures. The result is a single, clean, and perfectly structured Markdown table, that we can query with our RAG.

3. Decode Scientific Research with a Combined Strategy

The Challenge: Your R&D team needs to stay on top of the latest breakthroughs, like the research in the paper “Reinforcement Pre-Training”. This paper contains relevant information in three different forms: the main text, charts, and tables packed with performance scores.

The Strategy: You don’t have to choose. A single strategy can combine Visual Extraction for the diagrams and AI Tables for the table results, creating a holistic and searchable extended version of the paper. AI tables enhance document processing by identifying and extracting all tables in a markdown format. This improves searchability and makes the data more suitable for a RAG context.

The Implementation: We define both Visual Extraction and ai_tables configurations within the same strategy.

{
	"name": "AI Research Paper Analysis",
	"vllm_config":{
		"llm": {
			"generative_model": "gcp-claude-3-7-sonnet",
			"rules": [
			"Analyse all charts and images and add a thorough analysis of every one of them"
			]
		}},
		"ai_tables": {
			"llm": {
				"generative_model": "gcp-claude-3-7-sonnet"
			},
			"rules": []
		}
}

The Supercharged Result: Your processed document is now a powerhouse of information. A researcher can:

  • Search for a concept: “What is reinforcement pre-training?” and get an answer and the corresponding paragraphs from the main text as you would normally in RAG.
  • Ask about a graph: Search for “Figure 4” and get a detailed explanation and the source, thanks to the enriched description by the visual model.
  • Ask data present in a table: Ask about data and results present on the tables.
Unlock Your Data, Your Way

These are just three examples of how the Nuclia RAG Extract Strategies put you in the driver’s seat. By creating custom, reusable rule sets, you can finally unlock the full value of your documents, no matter how complex they are.

Ready to build your first strategy? Log in to the Nuclia dashboard or dive into our documentation to learn more.

Carmen Iniesta