top of page
Search
  • Writer's pictureSachin Tah

The Rise Of Vector Databases - Part 1



There is quite a fuss around vector databases in recent months and it looks like they just managed to gate-crash the party on time along with their cousin generative AI.


I was a bit curious about the patent owner of a vector database, unfortunately, I ended up hitting multiple roadblocks. However, at least I was able to figure out that the first need of having a database like a vector db was realized in late 1970 when storing DNA chain data was getting difficult and cumbersome. Stanford was one of the pioneers in leading research for vector DB. It is also difficult to figure out the first commercial availability of a vector database, we have quite a few now which I will cover in the second part of my article. In this article, I would like to touch base on the following topics and will be covering the first two in part one, what are vector databases, why do we need a vector database anyways, why it's a new buzzword, and how it is related to GEN AI, What are some uses cases where I should be using vector database?


Like every other student, we always used to challenge the contents of our syllabus. We always complained about irrelevant syllabi which would have no practical usage in the future and of course, we were wrong. Mathematics was one of them, today in order to understand and work on any solutions powered by artificial intelligence, you need to have a very clear understanding of mathematical concepts like Linear Algebra, Calculus, Probability, Statistics, and whatnot.


Database


In order to understand Vector DB, it is important to know how traditional databases work, how information is stored and retrieved from them, and what functionality it lacks which triggered the invention of the new database.


Relational & Document Database

Typical relational databases known as RDBMS are used to store structured data which are relational in nature, for example, a customer and its order. In order to use data inside RDBMS, data storage structures needs to be predefined (schemas) and then data is organized in these predefined structures called tables and fields.


NoSQL databases are used to store non-relational kinds of data that are not totally driven by predefined structures or schemas. They are mostly schema-free, however, you do have the option to define schemas if you want to. The way data is stored depends upon the type of database we are using. NoSQL databases are categorized into 4 types, Document, Graph, Key-Value, and Column.



I will not go deep into details of what each of these databases are used for and how they store data. But before we talk about vector databases, it is important to understand some concepts which are used unknowingly daily by each one of us whenever we try to retrieve information back from a database, it is Search.


Database Search


A database search is an operation that is performed in order to retrieve information back from a database. Broadly there are two different types of searches available. In order to understand Vector DB better, it is important to understand the search techniques available today, lexical and semantic.


Lexical search, which is also termed keyword search, retrieves information from a database by matching exact words or search phrases with those stored in the database. Semantic search also termed similarity search uses natural language processing (NLP) techniques to provide search results that are more meaningful and logical.



Let's take an example of how lexical and semantic search will handle search operations performed by a user. First, let's have a look at how keyword-based search retrieves contents from the database.

In this example, search keywords provided by users like "big applications" and "microservice" resulted in getting matches from some contents stored in the database. However, it failed to figure out the same contents like large applications and macro services which a human in turn would have identified very easily.


When trying to use the same example, but with contents stored inside a vector database, you will realize the results are more human-like now. The first two examples are obvious, however, if you see the last two examples, the query realized that large is similar to big and macro is a spelling mistake in this context and Micro should be used to provide search results.

Now we know what sort of problems vector databases are here to solve, let us understand what it takes for this guy to provide you with such kind of results. Under the hood how the data is stored and processed to come up with such intelligent results.


Vector Database


The underlying architecture of a vector database utilizes artificial intelligence to store, process, and retrieve the data. Now in order to store data inside a vector database, it wants you to first convert the content into a numerical representation that is understandable by a vector database, this is called embeddings.


These are nothing by data converted into a vector with values like [0.1,0.56,0.67] and so on. There are ready-to-use algorithms available that will take your data as input and provide vector values as output. The entire process of data storage in a vector database may look like this


Data is first passed via models (embeddings), these models are responsible for converting text data into a series of multi-dimensional matrices called vectors. For example "Hello World" vector representation could be [0.1,0.53]. Another important fact is that these words and sentences are stored using a clustering algorithm so that they are close to each other.

While trying to retrieve back information from a vector database, user queries are converted back to vectors, and vector-to-vector searches are performed using an algorithm like a nearest neighbor also termed as cosine search


The algorithm which was used initially to store data in a vector database should be the same when information is searched back from the database.


Content Chunking


Theoretically, this seems like a silver bullet to all search problems, however, when it comes to practical implementations and enterprise-grade implementations, there are good levels of nuances associated with this as well. Issues like, how will you store large contents into a vector database, how to ensure the best search results which are contextual and also domain driven etc. etc. To achieve all of this, just technology may not suffice, you also need a robust application design and solid underlying architecture to support and achieve desired business outcomes.


To address the first problem, we need to understand how to break down large contents into small meaningful vectors which act as individual units for your search. This is done primarily to ensure the relevance of a sentence within a document. If you store an entire 100-pager document in a single vector row, every time you search you will get all 100 pages back as search results. This may satisfy semantic search criteria, however, it will fail to provide focus areas, and fig. out contextual meaning within your document.


Therefore if you want only specific paragraphs which are relevant and contextual, you need to ensure you break down your 100-pager document into smaller units that are stored as multiple rows in a vector database. This process of breaking large text into smaller meaning full units is called content chunking. Eventually what you store inside your vector databases are chunks converted into vectors.


Let us take a very simple example, in the below example a large text is divided into two smaller chunks and stored inside a vector database.

If you try to perform a vector search on the above content, it may give you Chunk#1 or Chunk#2, or both depending upon your search criteria. The above example may not be the right fit for all use cases, depending upon the scenario, we can choose upon chunking strategies, these strategies are fixed size, content aware, recursive, and specialized chunking. Depending upon the scenario, use case, and nature of the content, you can choose a chunk strategy.


In my next article, I will cover use cases where you can use vector databases, I will also cover what role vector databases play when it comes to Generative AI and how to integrate both of these technologies together to solve enterprise use cases. Watch out for part II and thank you for reading.






304 views1 comment

Recent Posts

See All
bottom of page