In this video, Jeff Winesett introduces the Amazon CloudSearch service. CloudSearch is a search engine that provides fast relevant results for searching large amounts of data.
- [Instructor] Search engines are built for the purpose of taking big piles of data that need to be searched and getting that data into an optimal searchable format. The results of searches need to make sense to the users performing them. And a search engine helps extract the best answers when asking questions about data. The primary purpose of a search engine is to be able to store searchable data in such a way as to provide the best answers to search queries in the most performant manner.
Searches need to be fast, reliable, and provide the best results based on relevancy to the query. This is one big distinction between something like a database and a search engine. Aside form also being slow, database queries return all the answers to a query. A search engine is designed to instead return the best most relevant answers and do so quickly. Amazon CloudSearch is just such a search engine. At a high level here's how it works.
The inputs to the search engine is the data, structured and formatted for the search engine and search queries using a query language specified by the search engine. The outputs are highly relevant results, which can lead to happy customers and increased engagement. And here's how this is done using CloudSearch. First a search domain is created that will be used to capture all of the structured data to make searchable. This domain has a couple endpoints for access including a document endpoint and a search endpoint.
The document endpoint is where data is sent to populate the domain with data. And the search endpoint is used to make query requests and retrieve the results of searching the data. The search fields are defined and configured along with the search engine parameters. This is like a schema for the search index. The names of the fields and the type of data in those search fields are defined. Similar to designing a database table to store data. Once these fields are defined, data can be uploaded to the search domain.
And once data is uploaded, search queries can be run using the search endpoint. CloudSearch consists of three primary services. The configuration service, the document service, and the search service. The configuration service is what is used to create and configure the search domain. First, a unique name is given to the domain when initially created. Then, indexing options are configured, which specify the names and types of fields to index.
CloudSearch helps with this by providing tools to scan the data, which suggest some options based on the data being searched. The configuration service is also used to specify text analysis options. These control things like language specific stopwords. These are typically the most commonly used words in a language that should be ignored when indexing. For example, in English, words like a, and, and the, should be ignored as matching on these won't help the relevancy of the results.
The text analysis options also specify synonyms that should be considered when searching, as well as words or terms that need to be mapped to common stems. For example, so that search for fish will also return results for fishing. CloudSearch provides some defaults for these options, but they can be customized for specific use cases. The configuration service is used to configure availability options, which enable the specification of multiple availability zones for deployment.
This ensures the service will survive a single zone disruption. It's also used to set scaling options to specify desired instance types and counts to support the search domain. Suggester can also be configured with this service. Google has been credited for making this famous. Suggesters, as the name implies, suggest possible matches for an incomplete search query. For example, as a user is typing keywords into a search box.
And expressions can be set. Expressions can be used to customize the relevancy ranking if the searches. By default documents in CloudSearch are ranked according to the frequency of the terms within the document. Expressions can be used to include other factors in this ranking. For example, take an application that has content that is voted on or rated on by the users of the application. These votes and ratings could be included into the relevancy score allowing the most popular content to rank higher in the search results.
The document service is used to make changes to the data stored in the search domain. Each domain has a unique document service HTTP endpoint that is used to add, update, and delete documents in the domain. Here is how data is added and updated. First, the data desired to be made searchable is retrieved from a data storage, which is often the database. Then, the data is converted to a format acceptable by CloudSearch.
This often revolves around removing bad characters, ensuring all are valid Unicode characters encoded as UTF8 in either XML or JSON format. Then, something called a document is created. A document represents an item to be returned as a search result. These documents are then uploaded to the search index either in batches or one at a time. An HTTP endpoint is what is also used to interact with the search service.
The search service is what is responsible for handling the search queries and suggestion queries against the data. The results will be returned as a list of documents ranked by relevance in either JSON or XML format. CloudSearch has a robust query language that allows searching within particular fields. It supports and, or, and not operators, which can be combined to perform complex Boolean searches.
Facet information is also retrieved, which categorizes the search results. Which specific data to return in the results can also be specified and there are options to control how query terms are processed. So, optimize for performance, lesson number three. When search is part of the feature requirements for an application, a dedicated search engine should be used. CloudSearch can help improve the user experience and maximize engagement.
- Benefits of cloud services
- Making architectures scalable
- Examining cloud constraints
- Virtual servers, EC2, and Elastic IP
- Using the Amazon machine image
- Elastic load balancing
- Using CloudWatch for monitoring
- Security Models
- Elastic block storage
- S3, CloudFront, and Elastic Beanstalk
- Handling queues, workflows, and notifications
- Caching options and services
- Identity and access management
- Creating a custom server image
- Application deployment strategies
- Serverless architectures