Vectors are not new, but they’re having a surge in popularity these days. As mentioned earlier, this is due to the newfound accessibility of AI/ML systems, and that the output of these systems are vectors. A common use-case is to build a model on stored data (text, sound, video), convert it to vector format, and then use it for “semantic search.” In this case, semantic search is performed when you take a new input, convert it to its corresponding vector, and find the most similar results in the database. Similarity is found using a distance function, such as Euclidean or cosine distance, and the results are often capped at the “k nearest neighbors” (K-NN), or k most similar objects. It can take a lot of time to encode the “training set” of vectors, so it makes sense to “cache” them in a permanent data storage system, such as a database, and perform K-NN queries there. Having a set of vectors that are ready to be queried for semantic searches makes a generally better experience for users, which has given rise to the notion of needing a “vector database.” (View Highlight)
This gives us pgvector, an open source PostgreSQL extension that provides an indexable vector data type. In a nutshell, pgvector lets you store vectors in PostgreSQL and perform K-NN queries with an assortment of distance metrics: Euclidean, cosine, and inner product. As of today, pgvector comes with one index, ivfflat, which implements the IVF FLAT method of vector indexing. (View Highlight)
What happens when you query indexed vector data may be a bit different than how you’re used to querying data in PostgreSQL. Due to the computational expense of performing nearest-neighbor searches over high-dimensionality vectors, many vector indexing methods look for “approximate” answers that are “close enough” to the correct answer. This has lead to the field of “Approximate Nearest Neighbor” (ANN) searches. The two dimensions that people look at for ANN queries are the tradeoff between performance and “recall”, where “recall” is the percentage of relevant results returned. (View Highlight)
Let’s look at the ivfflat method as an example. When building an ivfflat index, you decide how many lists you want to have in it. Each list represents a “center”; these centers are calculated using a k-means algorithm. Once you determine all of your centers, ivfflat determines what center each vector is closest to and adds it to the index. When it’s time to query your vector data, you then decide how many centers to check, which is determined by the ivfflat.probes parameter. This is where you see the ANN performance/recall tradeoff: the more centers you visit, the more precise your results, but at the expense of performance. (View Highlight)