- [Instructor] In the example we just looked at, we had a document and the document had a bunch of fields, which is going to be the case any time we index anything. So when we start indexing any document, the most important thing to keep in mind are the field options. And they are analysis, which literally means indexing. And then storing, and then the term vectors. Indexing and storing are easy to understand.
An index is nothing more than a map or a dictionary, from terms to documents. The analysis option controls how the field will be indexed, and therefore how it will be searched. Will it break the text into tokens and then search the tokens? Or is it going to search it as a whole value and just do a point query? For example, searching using an ID. When we search using an ID, we have to make sure we have the entire value as it is.
The storing option also makes a decision as to are we going to store the actual data in the index or not. Let's look into the field options for analysis. The first one is Index.ANALYZED. In this case, field will be indexed and then it'll be analyzed. And what that means is it's going to be saved as tokens that are going to be searchable later on. Index.NOT_ANALYZED means the field will be indexed, but it's going go be indexed in its original form.
So let's say you have an ID and you want to keep the ID intact. In that case, we will make sure that the analyzer does not perform any different kind of analysis on it because we want to keep the value as it is. You can also do Index.NO, that means it won't be indexed, so it won't be searchable. And then we also have analyzed without any norms and analyzed with norms. And norms are something used in order to create boosting, which we'll cover in a minute.
Then we have the field options on storing. Store.YES, as the name suggests, means the value is stored in the index, so it can be later retrieved using an index reader. And if we say Store.NO, the value isn't stored. Let's talk about field options, term vectors. The term vector, just like storing, can be stored, but it stores meta data, which is generated by indexing.
As we've discussed earlier, the index is a map from terms to documents. Similarly, the term vector is also a map or a dictionary. But from the terms, to positions, offsets, and frequency information in the document that the term belongs to. The term vector helps us find the position of that particular word or phrase in the entire blob of data that we have. Using the positions and the offsets, term vector helps us get the data we need that is being requested.
And then we also have an option to say TermVector equals no. Most of the times you're going to see field options used in combination with each other. The first one is Store.YES and Index.ANALYZED. In this case, we have an index and we analyze it, but we also make sure that we store the data as it is. Now this is not the best thing to do when we have a lot of big content, like a blog post, with anything which has a lot of verbose text involved because that will increase the amount of data we have in our index.
We should always try to keep in mind that if we're storing something, it is a property or a field that we don't want to change at all. So this makes sense for ID and some other values, for example, a city or a state in case of address. Next one is Store.NO and Index.ANALYZED. This means we're not going to store the data in our index and we want to keep it in the source of record, which is outside our search engine.
But at the same time we still want it to be analyzed and then searchable. So this makes a good case for something like a blog post. And number third is Store.YES, where we are going to store this data, but we also are not going to analyze it. Now this is the best for data that we want to keep in the original form, like an ID. One of the good things about a regular search engine is that not only it's good for unstructured search, it could also be very good for a structured search.
And there are times when we may actually need the un-inverted index. Let's say we have a document and all we want to do is find all the terms associated with the document that came in after indexing that document. So in this case we can have something like Index.ANALYZED and then use a TermVector, which has positions, offsets, and also the frequency of how many times these particular terms are existing in that particular document.
Now the index will tell us which document matched our query, but at the same time, the term vector will tell us how and where exactly it matched. So you can think of term vectors more like a miniature inverted index for just that one particular document.
- Querying and indexing
- Creating a search service
- Using APIs during searching
- Importing JSON data
- Handling synonyms
- Working with suggestors and facets