- [Instructor] A typical search process has the following components: first is acquiring content. Usually, you have an RDBMS or a document database that is the actual source of record. Treating search engines as a source of record is usually a bad idea. Not that it can't be done, it's not the most robust option especially if the data is important to you or your company. Typically, you would only use a search engine to index the actual data.
Second is building the document. A document is a unit of search. Most of the times, we need to ask ourselves questions like how will a typical user search this particular document. Let's say you have some Twitter feed, and you have a Twitter post, and you're searching through a lot of Twitter posts. What exactly in that post needs to be searched? Is that the hashtag or is it some other words inside that particular document? Is it a reference to a user? There all different kinds of searches.
Understanding how a document needs to be searched helps us index the document or build the document in the right fashion. After we've categorized our data into documents, we need to analyze the data. Then of course, create the indices for the documents that need to be searched upon. Every document has certain properties and fields. Fields can have different data types such as numbers, strings, your locations, and date-time, and many others.
When it comes to analyzing the document, tokenizer is what splits the text into tokens. Different kind of analyzers can use different kind of tokenizers. Let's take an example. One analyzer is Keyword Analyzer. That means it does not actually split the text at all and takes all the field as a single token. Now we have another one which is Standard Analyzer.
A standard analyzer would create split points on spaces and as well as on punctuation marks. If you have a sentence that good developers are very passionate, it will actually split it into five words which are Good, developers, are, very, passionate. That's a Standard Analyzer. Stemmers are used to identify similar words. It looks for the base or the root of a word and then searches similar words.
Now keyword analyzers can not actually use stemmers as they're going to pass the entire field unmodified. If you're going to search some words in English text, it's probably not a good idea to use the keyword analyzer. Now we can use a stop for an analyzer which is a very common analyzer. What it does for us is removes the most frequent and the useless words in a sentence. When I say useless words, I really mean a, and, and the which are the articles, and then sometimes it's okay to remove the pronouns.
You might also want to remove words like be, have, and others just to reduce the footprint of your index. That's a very standard analyzer. Finally, we index the document and store it into files. A lot of search engines will use the RAM to actually move their indexes over to the RAM so the queries are a lot faster. Now that we've seen how indexing works, the next is the workflow for querying a document.
First thing we need to do is take the query and build a query in a fashion that is understandable for our search engine. Then we run the query which talks to the index, retrieves the results that come back, and pass it back to the browser or any other client. If you were to use any MVC kind of architechure which is Model View Controller architecture, you'll notice that the controller takes in the request, runs the query, gets the index, and then takes the results, sends it back to the model.
Then the model is sent back to the client via the controller. A lot of times you'll notice that model might have a lot more fields or properties than the actual view model that the controller might send back to the UI.
- Querying and indexing
- Creating a search service
- Using APIs during searching
- Importing JSON data
- Handling synonyms
- Working with suggestors and facets