From Unstructured Data to Connected Knowledge

Default Blog Top Image
by Carles Onielfa Posted on January 24, 2025

 

Previously published on Nuclia.com. Nuclia is now Progress Agentic RAG.

Often, questions over a knowledge base require more information than that explicitly present in the documents. For instance, a complex question like

Which lawyers have worked on a case with a senior partner and also specialize in contract law?

Answering this question would require extracting information from multiple sources and reasoning over it. To tackle such challenges, the first step is to structure the data in a way that allows efficient traversal and inference. One effective method to achieve this is by creating a knowledge graph.

A knowledge graph is a set of entities (nodes) connected by relationships (edges) defined over factual information present in a knowledge base. Having this graph enables visualizing and leveraging the connections between the entities present in our information.

knowledge graph example with city, eiffel tower, paris, museum, louvre mona lisa an da vinci
1.Example of a Knowledge Graph.

Knowledge Graphs and Data Augmentation Agents

At Progress, we recognize that there is no one-size-fits-all solution to knowledge graphs and that much of their power resides in domain specific entity and relation detection. That is why we have been working on providing a tool to our clients to automate building a knowledge graph tailored to their use case and domain.

We have implemented this feature through our “Data Augmentation Agents”, a framework in which we provide several tasks that run on top of our standard processing and that can also be retroactively applied to existing resources. Clients can enable agents which perform specific tasks that will add additional processing steps to extract the most out of their information.

To define the graph generation task, we require the following information:

  1. List of desired types of entities to extract, with optional descriptions
    • Descriptions are recommended for the entity types that might be ambiguous or might benefit from clarification.
    • Only relationships between entity types mentioned here will be extracted
  2. Specific examples of entity types
    • Triplets of text, entity type and text snippet where the entity appears
    • At least a couple examples are recommended and do not need to be exhaustive regarding the defined entity types
  3. Examples of relationships between entities in a text
    • Elements with source entity text, target entity text, relationship label and a text snippet where the relationship appears
    • At least three examples are recommended, they will help guide the extraction of relationships, so pay special attention to the styling of the relationship label you provide

A Practical Example

To showcase how one would configure the Data Augmentation Agent for graph extraction, we will assume that our knowledge base contains documents from the legal domain.

Using the Nuclia.py Library

To install the nuclia.py library, follow the instructions here: https://docs.rag.progress.cloud/docs/develop/python-sdk/README

Then you must be logged in, you can do that with the CLI command below (more information here https://docs.rag.progress.cloud/docs/develop/python-sdk/auth) :

nuclia auth login

Then, we start by importing the required libraries and accessing our default knowledge base.

from nuclia import sdk
from nuclia_models.worker.tasks import TaskName, ApplyOptions, DataAugmentation
from nuclia_models.worker.proto import (
GraphOperation,
EntityDefinition,
EntityExample,
RelationExample,
LLMConfig,
Filter,
ApplyTo,
Operation,
)

kb = sdk.NucliaKB()

Now, we define the required data for this task.

  1. List of entities: entity_defs
    entity_defs = [
    EntityDefinition(
    label="PLAINTIFF",
    description="The person or entity that initiates a lawsuit",
    ),
    EntityDefinition(
    label="DEFENDANT",
    description="The person or entity against whom a lawsuit is filed",
    ),
    EntityDefinition(
    label="CONTRACT",
    description="A legally binding agreement between two or more parties",
    ),
    EntityDefinition(
    label="CLAUSE", description="A specific provision or section of a contract"
    ),
    EntityDefinition(label="STATUTE"),
    EntityDefinition(label="DATE"),
    EntityDefinition(
    label="DEFENSE ATTORNEY",
    description="The lawyer who represents the defendant in a lawsuit",
    ),
    EntityDefinition(
    label="JUDGE", description="The presiding officer in a court of law"
    ),
    EntityDefinition(
    label="PLAINTIFF ATTORNEY",
    description="The lawyer who represents the plaintiff in a lawsuit",
    ),
    EntityDefinition(label="COURT"),
    ]

    Here we included the desired entity types with descriptions for some of them. We also included the entity type “DATE” even though it is already extracted by the default processing because we want to capture relationships between our new entity types and DATE.

  2. Entity examples: entity_examples
    entity_examples = [
    EntityExample(
    name="John Doe",
    label="PLAINTIFF",
    example="John Doe has filed a lawsuit against ABC Corporation for breach of contract.",
    ),
    EntityExample(
    name="ABC Corporation",
    label="DEFENDANT",
    example="John Doe has filed a lawsuit against ABC Corporation for breach of contract.",
    ),
    EntityExample(
    name="Service Agreement",
    label="CONTRACT",
    example="The Service Agreement contains a termination clause which allows either party to end the contract with 30 days notice.",
    ),
    ]

    For our entity examples we provide three examples, in two of them we reuse the same text, but with a different extracted entity in each.

  3. Relationship examples: relation_examples
    relation_examples = [
    RelationExample(
    source="John Doe",
    target="ABC Corporation",
    label="Plaintiff sues Defendant",
    example="John Doe has filed a lawsuit against ABC Corporation for breach of contract.",
    ),
    RelationExample(
    source="Service Agreement",
    target="termination clause",
    label="Contract contains Clause",
    example="The Service Agreement contains a termination clause which allows either party to end the contract with 30 days notice.",
    ),
    RelationExample(
    source="Consumer Protection Act 2022",
    target="Consumer Rights Act 2015",
    label="Statute amends Previous Statute",
    example="The Consumer Protection Act 2022 amends the Consumer Rights Act 2015, especially regarding digital goods.",
    ),
    ]

    Lastly, we use our relationship examples to define how we want to structure the relationship labels. Here we are providing examples with a “noun verb noun” pattern, so the extraction will be guided towards that style of relationship labeling.

    With these definitions done, we can move forward and start the task, we will enable the task for both existing resources and for any resources we upload from this point onwards with ApplyOptions.ALL, we will select ChatGPT 4o Mini from Open AI as the LLM to perform the task:

    graph = GraphOperation(
    ident="legal-graph-operation",
    entity_defs=entity_defs,
    entity_examples=entity_examples,
    relation_examples=relation_examples,
    )
    response = kb.task.start(
    task_name=TaskName.LLM_GRAPH,
    apply=ApplyOptions.ALL,
    parameters=DataAugmentation(
    name="Generate Legal Graph",
    on=ApplyTo.FIELD,
    operations=[Operation(graph=graph)],
    filter=Filter(contains=[], resource_type=[]),
    llm=LLMConfig(
    model="chatgpt4o-mini",
    provider="openai",
    ),
    ),
    )
    print(response)

    With this, our data augmentation agent will now be active, you can access the task status, stop, resume and delete it with the following commands:

    task_id = response.id
    
    # Get all tasks
    task_list = kb.task.list()
    
    # Get tasks that are running on existing data
    running_tasks = kb.task.list().running
    
    # Get tasks that are configured to run on uploaded data
    config_tasks = kb.task.list().configs
    
    # Get tasks that are done running on existing data
    finished_tasks = kb.task.list().done
    
    # Check task status
    response = kb.task.get(task_id=task_id)
    
    # Stop a task
    response = kb.task.stop(task_id=task_id)
    
    # Resume a task
    response = kb.task.restart(task_id=task_id)
    
    # Delete a task
    response = kb.task.delete(task_id=task_id)

For more information on how to use the Nuclia SDK to manage data augmentation agent tasks, please check the Nuclia Python CLI/SDK documentation: https://github.com/nuclia/nuclia.py/blob/main/docs/11-agents.md

Through the Nuclia Dashboard

To create the Graph Extraction Agent, we go to the tab “Agents” on the sidebar and then click on “New Task” in the “Graph Extraction Section.”

showing graph extraction new task

Afterward we fill out the required fields following our desired extraction in the legal domain. We use the same entity definition and examples as in the previous example (not all are shown in the screenshots).

showing ner types interface

Finally, we select our model, in this case we selected “OpenAI ChatGPT-4o-mini” and click on activate extraction. When our task is created it will appear in the Agents tab.

a screenshot of progress agentic rag showing the agents - labeler, generator, graph extraction etc.

Since we selected the task to be applied on both existing resources and new resources, we have the two tasks listed. The task applied on existing resources will change its status to “Completed” once the existing documents are processed. The task on new resources will show up as “Watching for updates” until we cancel it. To stop or delete a task, we can click the three dots under “Actions” for that specific task.

1 Bourli, S., & Pitoura, E. (2020, December). Bias in knowledge graph embeddings. In 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. 6-10). IEEE.


Carles Onielfa
View all posts from Carles Onielfa on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.
More from the author

Related Tags

Prefooter Dots
Subscribe Icon

Latest Stories in Your Inbox

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Loading animation