Previously published on Nuclia.com. Nuclia is now Progress Agentic RAG.
A knowledge graph is a set of interconnected data. Exploring a knowledge graph and finding new connections across your data can be an exciting experience. The Nuclia platform extracts entities and relationships from your unstructured data and provides a graph API to query them. In addition, you can also upload your own graph and query them, or use them to extend your Nuclia searches in yet another dimension.
Let’s see how we can leverage the power of graphs on top of your unstructured data!
A graph is a mathematical structure used to model pairwise relationships between objects. In other words, a set of concepts connected by pairs through a relationship. For example, “Nuclia is a company” can be thought of as a concept Nuclia and a concept company connected by a “to be” relationship. These concepts are called “entities”.
A knowledge graph is a graph representing real-world entities and relationships, as we’ve just seen in the example. Sometimes it is really useful to think about data as a graph. Imagine for example a rental agreement; the landlord and the tenant have a rental agreement for an apartment. All of them have properties like name, phone number, address that can also be modeled as connections. Imagine now you could get a thousand rental agreements and ask all names of all tenants with a rental agreement with a specific person? That’s the power of graphs.
Nuclia capabilities include extracting entities and relationships from the documents you upload. For each document, the Nuclia platform automatically extracts a knowledge graph and all those graphs are merged into a bigger graph, corresponding to your whole Knowledge Box. Thus, you can then search inside your Knowledge Box to find connections across or within documents.
Usually, graph explorers are interested in one of three things:
Node and relation exploration are usually the first entry point to the graph. In both cases, we want to know which nodes or relations exist in the graph that are exactly or similar to a given user query, for example, Erin, Nuclia or philosophy.
Path exploration dives deeper into the knowledge graph and will be able to respond questions like:
Any node in the graph is composed by 3 parts:
A value is just the textual representation of the node. For example, the name of a person, of a company or a concept. Erin, Nuclia and philosophy would all be valid node values. In the rental agreement example, this could be: landlord, your name or the address of the property.
Each node has a type, usually it’s an entity
, although others like resource
or user
are also valid (see the API reference for the full list).
Optionally, nodes can also have a group. Groups are arbitrary categories one can use to cluster nodes. For example: person, company or concept; or in the rental agreement example: person, apartment or contract. Those groups are optional to further categorize data, depending on the use case they’d be more or less useful.
The Nuclia platform provides the /graph/nodes
endpoint to search nodes in the graph. When querying nodes, you can use any of the node parts to find matches. Node values can be searched with different strategies while types are limited to a set of built-in types and group/family are matched exactly.
Let’s see some examples.
A simple example would be searching for an exact node existence. Does a person named Erin exist in the graph?
{
"query": {
"prop": "node",
"value": "Erin",
"type": "entity",
"subtype": "person"
}
}
And the response could be:
{
"nodes": [
{
"value": "Erin",
"type": "entity",
"group": "person"
}
]
}
What if we don’t know Erin’s type or group? We can omit them and find any node with value Erin and find the same result:
{
"query": {
"prop": "node",
"value": "Erin"
}
}
We can also omit the value and search only for a given type or group:
{
"query": {
"prop": "node",
"group": "person"
}
}
and, for example, find more results:
{
"nodes": [
{
"value": "Anna",
"type": "entity",
"group": "person"
},
{
"value": "Erin",
"type": "entity",
"group": "person"
}
]
}
As mentioned, values can be searched using different strategies. Until now, we’ve used the implicit exact match:
{
"query": {
"prop": "node",
"value": "Erin",
"match": "exact"
}
}
But we can become typo tolerant with fuzzy search:
{
"query": {
"prop": "node",
"value": "Arin",
"match": "fuzzy"
}
}
This will return the same node Erin and any other matching similar values.
Fuzzy search is a useful tool, but can quickly lead to an excess of results, so we recommend using it carefully.
Relations are composed by:
A label is the textual representation of the relationship. For example: friendship, knowledge about…
Each relationship has a type that classifies different relations. Usually, that will be an ENTITY
relation, but other types like SYNONYM
or ABOUT
are also available (see the API reference for the full list).
Relations can be queried using the /graph/relations
endpoint. The relation API is more limited than nodes API, as relations without nodes lose context. Let’s see some examples as we did before!
Is there any relation named live_in
?
{
"query": {
"prop": "relation",
"label": "live_in"
}
}
The response could be:
{
"relations": [
{
"label": "live_in",
"type": "ENTITY"
}
]
}
Get 50 synonym relationships:
{
"query": {
"prop": "relation",
"type": "SYNONYM"
},
"top_k": 50
}
And get a response like:
{
"relations": [
{
"label": "live_in",
"type": "ENTITY"
},
{
"label": "inhabit",
"type": "ENTITY"
},
{
"label": "populate",
"type": "ENTITY"
}
]
}
Once we know any combination of source and destination nodes and/or relation, we can actually explore paths between nodes.
A path is a triplet composed by source node, relation and destination node. Path queries are built from any of those known parts and the response is a set of triplets satisfying the query.
A path query where we know some information of every part would look like:
{
"query": {
"prop": "path",
"source": {
"group": "person"
},
"relation": {
"label": "born_in"
},
"destination": {
"value": "UK",
"group": "place"
}
}
}
And the results could be:
{
"paths": [
{
"source": {
"value": "Erin",
"type": "entity",
"group": "person"
},
"relation": {
"label": "born_in",
"type": "ENTITY"
},
"destination": {
"value": "UK",
"type": "entity",
"group": "place"
}
}
]
}
However, we may not know some part of it. We can skip nodes and relations:
{
"query": {
"prop": "path",
"destination": {
"value": "UK",
"group": "place"
}
}
}
And find more paths:
{
"paths": [
{
"source": {
"value": "Erin",
"type": "entity",
"group": "person"
},
"relation": {
"label": "born_in",
"type": "ENTITY"
},
"destination": {
"value": "UK",
"type": "entity",
"group": "place"
}
},
{
"source": {
"value": "Tom",
"type": "entity",
"group": "person"
},
"relation": {
"label": "live_in",
"type": "ENTITY"
},
"destination": {
"value": "UK",
"type": "entity",
"group": "place"
}
}
]
}
For simplicity, the graph API provides you with some common properties to search for. Instead of a path with only a source or destination, we can specify a source_node
or destination_node
respectively. For relations, we can use the relation
prop as we were using it before.
Therefore, the previous query can be rewritten as:
{
"query": {
"prop": "destination_node",
"value": "UK",
"group": "place"
}
}
Fuzzy search can also be used as before, defining the type of match:
{
"query": {
"prop": "destination_node",
"value": "France",
"group": "place",
"match": "fuzzy"
}
}
Sometimes, we know about two nodes being connected by a relation but we don’t know the direction of the relation. Path queries have a special field called undirected
that can be set to search for paths in any direction.
List all friendship relations between people:
{
"query": {
"prop": "path",
"source": {
"group": "person"
},
"relation": {
"label": "friend"
},
"destination": {
"group": "person"
},
"undirected": true
}
}
Or get all triplets related with the UK:
{
"query": {
"prop": "path",
"source": {
"value": "UK",
"group": "place"
},
"undirected": true
}
}
Similar as before, we have a shorthand for undirected paths where we only know a node but not its position. An equivalent query for the one above would be:
{
"query": {
"prop": "node",
"value": "UK",
"group": "place"
}
}
The response here could be:
{
"paths": [
{
"source": {
"value": "UK",
"type": "place",
"group": "person"
},
"relation": {
"label": "is",
"type": "ENTITY"
},
"destination": {
"value": "country",
"type": "entity",
"group": "region"
}
},
{
"source": {
"value": "Erin",
"type": "entity",
"group": "person"
},
"relation": {
"label": "born_in",
"type": "ENTITY"
},
"destination": {
"value": "UK",
"type": "entity",
"group": "place"
}
},
...
]
}
Finding UK in both positions: source and destination.
All queries explained until now are really powerful to start exploring the graph, but don’t have room for much expressivity. That’s why the graph API also offers boolean expressions. All three endpoints offer and
, or
and not
expressions in their query and can be nested as much as one wants.
Let’s see a more complex example to know about any person that was born or lives in any place different than the UK:
{
"query": {
"and": [
{
"prop": "source_node",
"group": "person"
},
{
"or": [
{
"prop": "relation",
"label": "born_in"
},
{
"prop": "relation",
"label": "live_in"
}
]
},
{
"prop": "destination_node",
"group": "place"
},
{
"not": {
"prop": "destination_node",
"value": "UK"
}
}
]
}
}
Although boolean expressions give us great power, remember that paths are built from triplets of source, relation and destination and multi-hop queries are not supported at the moment.
Therefore, even if we have a triplet for Erin born in UK, and Erin lives in UK, querying:
{
"query": {
"and": [
{
"prop": "relation",
"label": "born_in"
},
{
"prop": "relation",
"label": "live_in"
}
]
}
}
won’t give us any result, as there’s no triplet satisfying this condition (a triplet has a single relation)
As in other search endpoints, graph API results are limited by the best K. To change the number of results returned by default, you must specify top_k
:
{
"query": {
...
},
"top_k": 100
}
The current maximum K is 500, but this value can change in the future.
Querying the whole knowledge graph is nice, but sometimes we have too many results or we want to specify a subset of the knowledge graph to search into. As in other search endpoints, graph API supports filter_expression
, a boolean expression of filters to prefilter in which fields search should be performed.
As a simple example, let’s see a filter to search only in sweet recipes written in English:
{
"query": {
...
},
"filter_expression": {
"field": {
"and": [
{
"prop": "label",
"labelset": "recipes",
"label": "sweet"
},
{
"prop": "language",
"language": "en"
}
]
}
}
}
(See filtering docs for more examples)
In addition to a field filter expression, security
and show_hidden
are also supported, giving you the ability to filter in or out results with certain security requirements or hidden.
There is also a special filter that can be combined with graph queries: generated
. This is really useful to query the graph generated by users, processors or data augmentation tasks:
{
"query": {
"and": [
{
"prop": "relation",
"label": "live_in"
},
{
"prop": "generated",
"by": "data-augmentation"
}
]
}
}
As the filter is also a prop, it can be used in boolean expressions as any other property.
/find
EndpointExploring graphs can give really interesting insights on your data, but many use cases are powered by leveraging text blocks extracted from your documents using different techniques (commonly keyword and semantic search). Our team thought that the combination of those techniques with a knowledge graph was a powerful idea, and that’s why we integrated graph search on the /find
endpoint.
There are some graph paths (entity-relation-entity) that are extracted from a specific paragraph. We can retrieve those in a /find
, and merge the graph results with keyword and semantic ones, leveraging yet another way to find answers in your unstructured data.
Let’s see our first example:
{
"query": "Who is Alice?",
"features": ["keyword", "semantic", "graph"],
"graph_query": {
"prop": "path",
"source": {
"match": "exact",
"value": "Alice",
"group": "person"
},
"undirected": true
},
"top_k": 20
}
First of all, we are using query
for the keyword/semantic question. features
include keyword, semantic and graph. Including graph
forces us to also define graph_query
, which is a graph path query (identical as the ones explained before). Finally, top_k
allows you to select the best 20 results.
If you are more confident about how good are the results with any of the retrieval methods, you can leverage a customized reciprocal rank fusion (RRF) algorithm and set weights for each method:
{
"rank_fusion": {
"name": "rrf",
"boosting": {
"keyword": 1,
"semantic": 2,
"graph": 0.5
}
},
...
}
Here, we define RRF boosting parameter to explicitly say we want semantic being doubly relevant than keyword and graph being half of keyword.
RRF boosting is yet another powerful feature to customize your search experience. If you want to know more about this kind of tweaking, you can read this amazing article about it.
A knowledge graph is automatically extracted from all your documents, but you can also upload your own graphs in the Nuclia platform. Building a custom knowledge graph can improve the performance of your search engine by providing a more structured and relevant representation of your data. This can be particularly useful for complex queries or when dealing with large datasets.
There are several ways to build a custom Nuclia knowledge graph:
Graph extraction agents can automatically extract named entities and identify relationships between them based on a short description of the expected entities plus examples or such entities and their relationships using an LLM.
Check out this amazing article for a more in depth explanation.
Nuclia Knowledge Graphs are stored in resources. You can use this fact to add small knowledge graphs for specific resources or to create your Knowledge Box graph and manage it as a known resource.
A knowledge graph is a set of entities and relationships stored in the usermetadata.relations
attribute of one or several resources. A graph involving some entities will have the following format:
[
{
"from": {
"value": "Alice",
"type": "entity",
"group": "PERSON"
},
"label": "speaks",
"relation": "ENTITY",
"to": {
"value": "Italian",
"type": "entity",
"group": "LANGUAGE"
}
},
{
"from": {
"value": "bc218b49700b4a5c9d5ea8a7cfcc8b6f",
"type": "resource"
},
"to": {
"value": "Kiswahili",
"type": "entity",
"group": "LANGUAGE"
},
"relation": "ABOUT"
},
{
"from": {
"value": "United Kingdom",
"type": "entity",
"group": "COUNTRY"
},
"to": {
"value": "UK",
"type": "entity",
"group": "COUNTRY"
},
"relation": "SYNONYM"
}
]
As you can see, we can define entity-to-entity relationships, but also resource-to-entity relationships or even synonym relationships. The possibilities are endless.
Storing the graph can be done directly through the API. For example:
POST {kb-path}/resources
{
"slug": "my-custom-graph",
"usermetadata": {
"relations": [
// your knowledge graph
]
}
}
As the data format is a bit verbose, when creating entity-to-entity relations only, you can use the update_graph
method of the Nuclia CLI/SDK which use a simpler format:
from nuclia import sdk
kb = sdk.NucliaKB()
kb.add_graph(slug="my-custom-graph", graph=[
{
"source": {"group": "People", "value": "Alice"},
"destination": {"group": "People", "value": "Bob"},
"label": "is friend of",
}
])
For more examples and information, please refer to the official Nuclia documentation.
Knowledge graphs are a powerful tool to model data to express knowledge of your unstructured data. With Nuclia functionality, you’ll be able to create your own graphs but also let the platform extract them automatically or through agents from your unstructured data.
Leveraging the knowledge graph during retrieval can help you find information in yet another dimension and improve your search results.
What are you waiting for? Come and try it yourself!
Subscribe to get all the news, info and tutorials you need to build better business apps and sites