ANN
) algorithms are used. ANN
search approximates the true nearest neighbor, which means it might not find the absolute closest point, but it will find one that’s close enough, with a low-latency and by consuming fewer resources.
In the literature, the comparison of the results of ANNS
with exhaustive search is called the recall rate.
The higher the recall rate the better the results.
Several ANNS
algorithms, such as HNSW
[1], NSG
[2], and DiskANN
[3], are available for use,
each with its distinct characteristics. One of the difficult problems in ANN algorithms is that indexing and querying vectors may require storing the whole data in memory. When the dataset is huge, then memory requirements for indexing may exceed available memory. DiskANN
algorithm tries to solve this problem by using disk as the main storage for indexes and for performing queries directly on disk.DiskANN
paper acknowledges that, if you try to store your vectors
in disk and use HNSW
or NSG
, you may end up with again very high latencies. DiskANN
is focused
on serving queries from disk with low-latency and good recall rate.
And this helps Upstash Vector to be cost-effective, therefore cheaper compared to alternatives.
Even though DiskANN
has its advantages, it also requires more work to be practical.
Main problem is that, you can’t insert/update existing index without reindexing all the vectors.
For this problem, there is another improved paper FreshDiskANN
[4]. FreshDiskANN
improves DiskANN
via introducing
a temporary index for up-to-date data in memory. Queries are served from both the temporary(up-to-date) index
and also from the disk. And these temporary indexes are merged to the disk from time-to-time behind the scene.
Upstash Vector is based on DiskANN
and FreshDiskANN
with more improvements based on our
tests and observations.