EP07 - Efficient Neural Search: Rethinking Inverted Indexes for Learned Sparse Representations. With Dr. Franco Maria Nardini
On other platforms: Web, Apple Podcast, YouTube.
This is the transcript of the episode polished with AI for readability
Luca: Hello and welcome to a new episode of targz. Today we talk about text retrieval and how to use inverted indexes with sparse vector representations of queries and documents. The topic is quite technical, so I will let Franco Maria Nardini, research director at ISTI-CNR, explain it while we discuss his paper titled "Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations".
Hello Franco and welcome.
Franco: Hello Luca, thanks for the invitation.
Luca: Yeah, it's a pleasure to have you here. And would you like to introduce yourself?
Franco: My name is Franco Nardini, I'm a research director of the ISTI-CNR, which is an acronym for Istituto di Scienze e Tecnologia dell'Informazione, National Research Council of Italy. I'm based in Pisa, and starting from my PhD, my research interest mainly focuses on information retrieval. The last, let's say, decade, it also involves AI applied to retrieval, efficient AI, and a whole other things that revolve around such kind of, let's say, keywords.
Luca: Cool. And so talking about your research, let's jump into the paper. Would you like to give us some context?
Franco: So the paper is a paper we published in SIGIR 2024. In this paper, as the title suggests, we tackle approximate retrieval and indeed we do it using an efficient solution involving inverted indexes. What we use here is learned sparse representations, and to understand what are learned sparse representations, we need to come back to 2018. We all know 2018: BERT, large language model, came on the scene. And indeed BERT, let's say, shows that large language models had at that time the potential to heavily impact both natural language processing and information retrieval.
Also, specifically from 2018, one line of research focuses on representations. What does it mean? Representations. Instead of using these big, let's say, models in production, we use the model offline to build representations of objects. What is a representation? A representation is basically one vector. So we have this representation, and the representation is very interesting because many papers show that those large language models are able to transfer the semantic information in the original domain into the latent space of the representation. So computing, in information retrieval, similarity for a query between queries and documents, becomes computing something in the latent space, which is L2, inner product, or cosine. That's the interesting part. Representations coming from large language models allow to transfer an idea, let's say, of similarity search into mathematical operations in the latent space.
I told you in this paper we focus on learned sparse representations. Why sparse? Because from that time up to today, the literature focuses on basically three kinds of representations. Sparse is the last one. The first two are dense — so the representation is a dense vector of real numbers — and multi-vector. What does it mean, multi-vector? So instead of encoding the query or the document into a single vector, we go down in the granularity, so we encode each term into a vector. You may imagine a lot of vectors in that kind of, let's say, family of encoders.
The last one is sparse, and sparse is very interesting because now the target is a sparse vector where dimensions are grounded in the vocabulary. What does it mean? I have a query, I have a document, and an encoder in the middle. My encoder encodes queries and documents into a sparse vector of terms. So the question here is: why do you need to encode something which is text into something which is text again? Because there is an interesting property of sparse encoders: you can learn to expand. So you have a query which is generally short and the encoder expands the query into a bigger one that includes terms that are semantically related with the original query, so to fight naturally vocabulary mismatch, which is a very well-known problem in information retrieval. Queries and documents go through the encoder, you encode into a sparse vector which is now expanded. This is super interesting. Indeed, the literature shows that by doing that, in many different flavors, you can let's say learn encoders that are very powerful in terms of effectiveness. What is the effectiveness? Basically the retrieval quality, measured with some, let's say, metric — from NDCG to MRR — depending on the scenario.
Luca: Sorry, a question. When you say "expand the query", you mean that I issue a query and then you can, let's say, retrieve results without really looking at the exact word that I'm saying. So you look at what I mean with my query, and so you can find basically all the answers to semantically similar questions.
Franco: Let me try to be very, very clear with an example for our friends at home. The example is the following: "What shoes do most NBA players wear?" Okay, super. Let's say basic query that involves indeed players, NBA, shoes. If you go through SPLADE-v3 encoder, which is a state of the art sparse encoder learned on text, the representation you have is: NBA with very high score, because indeed that is what is very important; shoes, which is very important; players, which is also very important. Those are the three keywords I expect to be there. But the encoder also adds — which is also very semantically aligned with the original query but not in the original query — and basketball is something that for that encoder is much more important than "players", for example, which sounds right. I expect that, because it is somehow a synonym of NBA. Indeed it's the sport to which NBA is related. That's the concept of expansion.
The process of expansion involves both queries and documents. So you basically have a representation which is still sparse, as the original text, but indeed includes much more terms, just to fight the vocabulary mismatch.
Now, there is one problem. Why sparse? Because indeed in information retrieval, people were used to sparse indexing techniques. What are sparse indexing techniques? Basically applying inverted indexes on top of text. A lot of, let's say, research in the last 50 and more years for super efficient solutions that allow to fetch, let's say, important results in a blink of an eye. But things changed from 2018 on. Why? Because indeed, while original textual collections were, let's say, characterized mostly by two properties — first, in web search, queries were short: what does it mean, a query was 2, 3, 4 terms at most; second, text terms in document collections were distributed following a Zipf's law — with neural encoders, both those things don't hold anymore. Because queries are longer, also because we have expansion. Documents don't follow anymore the Zipf's law. And inverted indexes were easily designed to exploit such kind of properties. So what was working very well now doesn't work anymore as before.
Luca: Can you just quickly explain what is an inverted index?
Franco: An inverted index is a data structure that indeed allows to search over a sparse representation — as the text is — just because you basically have a structure that links terms to all the documents that contain that specific term. When you have a query, you look at the so-called posting list, so all the document lists that are indeed linked to that specific term, and just focus on top of that, disregarding everything else. That's the main idea behind the standard inverted index. Then there is a ton of literature on how to optimize it — a lot of tricks to, let's say, have that kind of traversal super efficient.
And that was working like a shot, but with neural representations, that kind of machinery doesn't work as before. So we need to find new ways to have efficient inverted indexes, efficient retrieval, with learned sparse representations. That's what we did here. In this paper we introduced Seismic. Seismic is an acronym that indeed leads to this data structure, which is composed of two main ingredients: the inverted index I was telling you before, and one forward index, which I will go through in a while.
The inverted index, as I told you, links terms to documents. The forward index you can think of as a lookup table: you go with the document ID, the forward index gives you the full representation, the full vector of that document.
Luca: Okay.
Franco: So the whole Seismic idea grounds on one property we discovered on top of the most important sparse encoders in the literature. We call it "concentration of importance". What does it mean? You have a sparse vector, which is the result of applying an encoder on queries and documents. Only few components of that vector are very important to determine the final inner product between the query and the document. You don't need to have everything. You just can keep something. And this is very important because, according to this property — if you want some numbers — you can somehow estimate the dot product at 90% of the real, actual dot product of a query-document pair, just by looking at 15% of the terms of the representation, the most important ones. The most important 15% of query and document vectors put together, and you have almost 90% of the overall.
What you can do is now try to redesign the overall data structure with this property in mind.
Luca: And that's what you did, right?
Franco: That's exactly what we did. Because what we did is to use the inverted index for filtering and looking at very promising documents, and then we use the forward index on top of this set returned by the inverted index to understand if we were right or not with our estimation of the dot product. The inverted index filters, the forward index allows to compute the exact dot product between queries and documents.
Now, the inverted index should ease the filtering, and what we need to do in order to be very efficient with the retrieval is to exploit the full, let's say, potential of the property we discovered. What we did — without going into too much detail — is to apply on top of the inverted index several techniques, which are several families of techniques.
One is static pruning, the other one is dynamic pruning. Static pruning: super simple and intuitive. Instead of keeping all the documents for every term, why do I need to keep documents that have low importance for that specific term? It's not interesting, because I know that that term would not be useful in the computation of the inner product with that specific document. So static pruning says: remove low-importance documents from the posting list for this specific term. And that's it.
Dynamic pruning: what does it mean? Instead of considering all the components of the query, just focus on the top 15% of the components of the query and use only that against the inverted index.
Then there is also an interesting part — I would say the most novel. What if, from an intuition point of view, inside one single posting list we put together documents that are similar and we group them into one single block? Then we somehow represent that block with what we call a "summary". It's basically one vector that is an approximation of the best possible inner product I have visiting the documents inside the block.
Luca: Okay.
Franco: So building the summary — introducing the concept of summary — allows me, instead of computing all dot products with all the documents in the block, to just compute the upper bound I have with the summary and ask: is this upper bound interesting for me or not? If not, I can just skip the whole block. So summaries, introduced by us in this paper, are an upper bound approximation of the possible inner product achieved inside the block. I go scoring my query with the summary. If the upper bound is not, let's say, interesting — skip, and we go to the next block.
Now considering that we block according to hyperparameter tuning, but indeed every block has more than 10 documents, we can easily say that we compute one in case we skip, or one more in case we then visit the block. But if we skip, we skip a lot.
Okay Luca, are you following me?
Luca: Yeah, of course. I mean, you're basically reducing a lot of the surface where you do the search by grouping similar documents together and understanding if the query is relevant for the group of documents.
Franco: Now, using this kind of machinery — so going through summaries and, if interesting, also visiting blocks — we are able to reduce a lot the number of dot products computed for every query arriving to my system. Then I save the interesting documents from this process, which is a query processing algorithm we introduce in the paper. And for the interesting ones, I also need the full inner product, because this is an approximated, pruning-based process. So for the ones that I consider interesting, I go to the forward index and I compute the full inner product of the query-document pair to build the final list of results.
This is Seismic. It's a two-level inverted index plus a forward index. For every query, we traverse the inverted index using skipping over blocks to identify an approximate list of documents that we consider relevant. And then with the forward index, we compute the full inner product. That's indeed what we introduced.
Now, the question we were asking at the time is: given all of this, what is the performance with respect to the state of the art?
Luca: Yeah, exactly. That's what I was gonna ask. What results did you get out of this?
Franco: At that time we were following with a lot of, let's say, curiosity the results of the Big-ANN 2023 challenge at NeurIPS, which introduced a sparse challenge. What is the sparse challenge? Faster retrieval over sparse representations. And we were very interested because the two winning solutions of the challenge were two graph-based solutions — solutions that use HNSW, Hierarchical Navigable Small World algorithm, very well known in KNN search, which works with a super different approach: a graph-based search.
But we are, let's say, efficiency guys in IR for many years. So we said: how is this possible? How is it possible that inverted indexes are not in the scene for the winning solutions? So those were the two natural competitors of our new strategy, which we knew were state of the art at that time. That's why we started a comprehensive experimentation over four different encoders — state of the art encoders at that time, now no more, because there are new encoders — but at that time the four encoders we used were state of the art.
And indeed, what we discover is that if you introduce a, let's say, double-index machinery like ours, you are able to come back to inverted indexes that are state of the art again on top of learned sparse representations. Because we show that, for example, if you consider SPLADE, a popular encoder, over MS MARCO, a popular collection, we are able to achieve from 2.6 up to 3.5 times faster retrieval than the winner of the Big-ANN challenge.
Luca: I would say a very good improvement.
Franco: Yes, indeed. With this machinery we are basically able to achieve 98% of recall on top of MS MARCO v1 in almost 500 microseconds, which is a super interesting result. While in the same scenario, PINN requires more or less two milliseconds. So we are indeed 3.5–4 times faster than the state of the art at that time.
So in terms of speed we were happy, because we are now state of the art. But what about the memory? There is also the memory — meaning the space required by the index. And what we did is indeed to test also in terms of space required to store indexes. And we show that the space required is more or less the same order as PINN, which is the best from the Big-ANN 2023 challenge. Specifically, PINN requires, over MS MARCO v1, more or less 5.5 gigabytes, while we require 6.5 — so one giga more. But the interesting story also here is that in terms of time for building the index, Seismic, our method, requires five minutes, while PINN requires 137 minutes on the same collection. So we are super fast in building the index, which is expected because an inverted index is way faster in building time than a super complex graph structure like PINN — with comparable space.
Luca: Oh yeah, sounds super interesting, and a very good, let's say, improvement over the state of the art. And so my next question is: what were the kind of next steps here? Because yeah, it sounds like very promising research. So I guess you have several follow-ups on this.
Franco: We investigated a lot of ideas and we are still investigating several, let's say, branches of research starting from this one.
The first is inference-free. I told you queries and documents go through an encoder online. So when you use the infrastructure, the query needs to be encoded online. What does it mean? The user types the query, you encode the query, and then you do retrieval. So with this infrastructure, encode time is no more negligible. The question is: is it possible to remove that cost, that encoding time cost? The answer is yes, introducing inference-free, which is a way to precompute query encodings without going through the encoder. I won't go into details for this because it's quite, let's say — it's basically learning without context from the beginning, so that you can look up directly the weights of every term of the query. That's a super interesting research direction, which is investigated not only by us. Last SIGIR, which is our premier conference, three or four contributions were focusing on inference-free encoders, inference-free retrieval architectures.
Second point: why are graphs very, very good in retrieval? We were quite, let's say, surprised by this. Is it possible to define mixed retrieval strategies? What if my inverted index exploits the good part of graphs, which is global relations, instead of focusing only on a per-term basis, which is what this data structure does? We have another paper in another conference, which is CIKM, that shows that if you go through an interplay of the two, you are also able to do even better than Seismic. So you basically reduce the latency — the query latency — without hurting the effectiveness performance, just because you pay a bit more space to store the neighborhood of each document. If I touch one document, I also explore the neighborhood, which is given by a graph structure. If you do that, you are also able to reduce time, just because you have more information on the neighborhood every document shares.
Then there is a third, to close, research direction, which is exploitation over resource-constrained devices. We investigated how to do retrieval over resource-constrained devices. In our tests: Raspberry Pi. Super limited, Raspberry Pi 3B, with one gigabyte of RAM. And we show that with this machinery you can retrieve on top of such kind of architecture doing 200–300 queries per second with super limited resources, which is something super interesting to me. Because indeed those are devices that cost 20, 30, 100 dollars, and they indeed allow — even in their extremely limited editions — to retrieve up to 200 queries per second, which is, let's say, not bad at all, at least in my view.
Luca: Well, Franco, this is super interesting, really. And thank you very much for coming to this podcast and sharing this research. Also, the next steps sound very, very, very promising and interesting. So thank you very much.
Franco: Thank you again for the invitation. A pleasure to be here with you and all the friends at home.