The “Whys” and “Hows” of Nuclia and NucliaDB

Previously published on Nuclia.com. Nuclia is now Progress Agentic RAG.

Hello, This is Ramon Navarro Bosch, CTO and Co-Founder at Nuclia. Together with Eudald Camprubí, my Co-Founder and companion in battles for many years, we’ve been building something for the last two years. Our vision is to deliver an engine that allows engineers and builders to search on any domain-specific set of data, focusing on unstructured data like video, text, PDFs, links, conversations, layouts and many other sources.

For many years, we’ve been fighting with general-purpose indexing tools to answer questions on top of a set of files/text/web/etc., but kept failing because general purpose- indexing tools and AI models do not have the same interface, the same goal, speed or scaling story. That’s why we decided to build an end-to-end solution to deliver this service as an API.

How We Do It

First, we need to know what it means to answer a question on top of a set of data (i.e., information retrieval, or IR, for those in the know.)

It’s a four-step process:

The first stage is to crunch the data—tokenize, vectorize, pre-process, etc. From this, generate previews to visualize a representation of the data, extract all possible text from images, frames and audio, along with the text of the resource itself. The data is then cleaned and split into paragraphs and sentences.
Next, we use state-of-the-art multilingual Natural Language Processing (NLP) techniques to understand all possible information on top of the extracted data—summarizing, obtaining insight entities (generic and custom domain-defined), computing embedded vectorized representations, labeling with user-defined labels and anonymizing if required. Finally, we end up with a normalized model that can represent all possible information.
Then, we store all this information, in our case, on our scalable open source database, NucliaDB. Back when we started it, the concept of vector databases did not exist so we implemented our own Hierarchical Navigable Small Worlds (HSNW) Rust component along with our own knowledge graph Rust component. These two components, plus Tantivy (a Lucene-like implementation on top of Rust) form a NucliaDB Node, an indexing Rust Remote Procedure Call service called GRPC, designed to store text, paragraphs, vectors and relations and provide a powerful search interface. On top of this indexing engine, we layer a python-based Representational State Transfer Application Programming Interface (REST API) for Create, Read, Update and Delete (CRUD), a Search and a Dataset API.
Finally, once the data is in NucliaDB, we generate the models that will be used in what we call “the understanding layer” to generate new data or handle queries to our API to predict intent, classification of text, entities or embeddings. We’ve enabled anyone to generate this data with your own domain information by training a custom model or extracting from your own database.

Cloud & OSS

We want to make this technology available for everybody, so we made a big effort to create an easy-to-use workflow for self-hosting, NucliaDB-hosting or using our Cloud.

For our Cloud, you just need to go to rag.progress.cloud, sign up and you will get a KnowledgeBox (our concept of a container of data.) where you can upload any kind of data via the desktop application, the Dashboard or using our Nuclia software development kits (SDKs) of Typescript or Python, or via our HTTP Rest API.
For self-hosting, you can download NucliaDB from Python Package Installer (Pip) or Docker. To search the unstructured data (e.g., files, links, conversations, layouts, etc.) you need a Nuclia API Key from rag.progress.cloud. We then analyze your data and return it normalized without storing anything on our servers. Once you have the service running, you can upload data using any of the same methods as for the cloud. NucliaDB open source software (OSS) can be installed on your laptop or deployed fully distributed on a Kubernetes Cluster with our helm packages, thanks to Jetstream streaming and TiKv.

Once the data has been processed, you can search via our custom widget generator, the Nuclia SDKs or via HTTP Rest API. In order to do semantic search or relations search, you will need a Nuclia API key so it can use the Predict API and generate the embeddings at query time. The Nuclia Predict API is designed to be fast (from 5 to 10ms of computation time) to add the least overhead possible on the search query. And there's no need to deploy a graphics processing unit (GPU) or a specific prediction architecture.

If you want to generate new models, you can play with our Nuclia Learning API:

On rag.progress.cloud you can trigger a training of intent detection, classification of text or entity recognition with your own annotated data via Nuclia Dashboard or the HTTP Rest API.
If you’re self-hosting, you can create a training of intent detection, classification of text or entity recognition on your cloud account and then use the nucliadb_dataset Python package to export annotated data (via the API) and push it to train the new model.

Once the model is computed, you can predict using the Cloud Predict API or download in TensorflowJS format—only for intent detection right now—and use it with the Typescript SDK.

Create Your Own NLP Model

With the nucliadb_dataset Python package, you can create your own stream of data for any of the supported tasks and create a partitioned set of Apache Arrow files that can be used on any of the common datasets NLP libraries to train, evaluate and compute new models.

Roadmap

Ranking is our main goal. You can already use a classical ranking for full-text search (e.g., boosting and BM25), and a good generic multilingual semantic search. We know we can do better, so we will add the option to train your own bi-encoder with your domain data/languages and annotated data/queries.

We are working to deliver the best experience in our Cloud and self-hosted OSS environments, and to provide tooling for moving from one to the other easily with a dump and a restore of a KnowledgeBox.

Building “ChatGPT” for Internal Data

We believe Nuclia and NucliaDB are the perfect match to train a model that can act like Chat GPT, but with real up-to-date data from your own environment. The bi-encoder helps to use your custom data and provides context for our approach to generative AI (GenAI), so we can provide a good qualified answer. No bias, only proper information.

Our goal is to make it feasible to create a model, small enough to run in multiple edge devices and push the question answering to the client side.

We Need Your Help!

Integrations via the Nuclia desktop applications are easy to develop and we invite everybody to help by adding integrations that would be useful to you!

We would like to invite all data scientists to test our nucliadb-sdk and nucliadb-dataset to create crazy models, give feedback and test the first NLP-focused open source DB (pypi/docker).

Any use case you have in mind, please come share on our Discord server.

We will continue to publish more articles explaining how to use all the power of Nuclia and NucliaDB on our Cloud and OSS.