Join Chander Dhall for an in-depth discussion in this video Inverted index, part of Azure Search for Developers.
- [Instructor] Inverted index is an important concept. Any time we talk about a search engine, we cannot just ignore inverted index. Now what's an inverted index? We all have read books, and anytime we read a book, if you notice, pretty much every book has an index towards the end of it. And how does that index work? For example, we're looking for a word, and the word happens to be "electronics." If I were to go back to the index, it tells me that the word "electronics" is used in page number six, page number 24, and page number 40, and this makes my life, as a reader, very simple.
'Cause I'll I have to do is now go to those three pages, and remember what I was reading at some point of time. Even if I read the book six months later, one year later, all I have to do is remember what I was reading, go the index, and now get to the exact page where I was. Now that same concept is used in case of search engines. And the way it works is, we take the data we need to index, and then create an inverted index out of it. And when we do the search, we actually use the index.
We do not use the data as it is. Searching the index data is a very tedious and a very expensive operation. We might end up taking too long to do it, at the same time we're going to use a lot of CPU. And then we will get the result, but it is not the best way to get to that result, especially if we have core boost data. For example, we have a blog post, or we have specifications about a product, and that's a lot of data. What we don't want to do is continuously go search for that word all over our database and then provide the results.
So it makes it easy for us if we have the data index, to get the exact places where the data was, and then refer that data back onto the client. Let's take an example of an inverted index. As you can see, we just have one line here that says, "Cazton.com has the best developers in the world." And that happens to be one of the sentences we want to index. Let's take another example. Another sentence we want to index is, "Best developers are passionate about code." And then we have, "Software development is super cool." And number four is, "Development or creation of anything is a cool feeling." Now how are we going to create the inverted index for this particular example? So once we have the sentences, we are now going to create the inverted indices.
One thing to keep in mind is when we create inverted indices, we don't want to create that for every single thing, we just want to do that for the data that's important. So let's say in this case we pick "developers" as one of the important words. Forget about the lowercase or uppercase or anything like that because that's something that an analyzer can handle, which we will talk about in a minute. But for now, let's focus on the word. Let's say we have "developers," and "developers" happen to be in two of the four sentences, as you can in sentence number one.
So if you look at the JSON object that I have over there, it isn't really a JSON object it's just a description of the inverted index. So in this case, I can see I have one comma four, and what does that mean? That means "developers" are in sentence number one, and it happens it be the fifth word in that sentence. Now, since the first word starts from a zero, because it's a zero based index, "developers" happens to be number four, because "cazton.com" is zero, "has" is one, "the" is two, and then we have "best" as three, and "developers" happens to be the fifth word and the number four in the index.
So now, what about sentence number two? As you can see, it's two comma one. Two means the sentence number which is two, and one means zero based index for the word "developers" in sentence number two. So now we have one comma four and two comma one. So out of four sentences, all we need to know about "developers" is these two values. Next we have the word "is," so if you want to go all the way to taking something like "is," it will be interesting that you can index "is" and it will be the third sentence and then the third word in the third sentence.
And then we have the fourth sentence, and the fifth word, in this case the sixth word, in the fourth sentence. And we have the word "cool," it's in the sentence number three, and it's also in sentence number four. And we also have the indices that represent the positioning. Well then we have "development," and "development" is used in sentence number three and four, at index number one and zero respectively.
Now that we have all these four indices, what can we do? Let's say one of the users searches for, "development is cool." So in this case, we'll have two records that need to be returned, line number three and line number four. And it's very simple, all we need to do is go and look into the index we created, we look for "development," and it's a key value pair, it gives you the sentence number, it also gives you the index number for "development is cool." Now did you notice one thing? That it does not actually mean that the sentence has "development is cool." All it means is that we're taking these three words, and we're giving you the results.
And these three words could be completely out of order in this case, and that's why we do an unstructured search. This is one of the benefits, as well as the downsides of an unstructured search. If you had to specifically search for a phrase, you would have had to index that phrase as it is. So for example, if you had to search for a phrase that's "development is super," well in that case we would have had only one index, which would have been "development is super," and since it exists only in line number three, that's the only the thing that would be returned.
- Querying and indexing
- Creating a search service
- Using APIs during searching
- Importing JSON data
- Handling synonyms
- Working with suggestors and facets