HDE BLOG

コードもサーバも、雲の上

Elasticsearch: the One Stop Shop for All Your Search Needs

Here at HDE, we started with a homemade indexing algorithm to create indices of data that would be searched by users of our service. But after several years of providing the service, we realized that the performance of this algorithm, and the system that came with it, was not built for the sheer scale of our growing userbase. We especially noticed this with some users whose data came in the scale of terabytes per day and demanded peak performance, which is what prompted us to finally look for an alternative in indexing and searching emails, and that's how we met Elasticsearch.

Elasticsearch is an open-source, distributed, RESTful search engine by elastic.co. It can also act as an analytics engine, which is the common use case for many users of Elasticsearch. This software package has a very broad range of capabilities, from simple text phrase search to complicated aggregation and queries. It can also be integrated with other components that come from the same creator, such as Logstash, Beats, and Kibana, to create what is called an ELK stack. This stack/setup is what is commonly used to make interesting visualization and interpretation of "big data".

What can you do with Elasticsearch?

To give a basic picture of how Elasticsearch is usually used, let's look at some basic questions that are meant to be answered by Elasticsearch:

  • Give me articles that are relevant to "Donald Trump is banning immigrants from entering the country"
  • I want all emails sent from "test.example.com" that contain the word "パスワード" sent between January and March 2017, that contain attachments
  • How many errors happened in Servers A to C in May 2015? Group them by container name

These are some sample problems that can be easily answered by utilizing Elasticsearch's many features, including language analysis, keyword matching, date range filtering, and many others.

What makes Elasticsearch "different"?

There are several features that I love about Elasticsearch. One of them is that Elasticsearch is highly scalable. From testing/dev to production, Elasticsearch requires very little change in configuration. Of course, if you want to tweak it deeply, there are configuration settings available. But for those of you who don't like messing with a lot of knobs and switches, Elasticsearch is ready to deploy anywhere and will just work.

The scalability of Elasticsearch also makes growing and shrinking the cluster effortless, with no config tweaking and/or special commands necessary. Just spin up a node, and the node will automatically join other nodes with the same cluster name. After that, the cluster will intelligently rebalance itself by moving shards inside the cluster so every node carries the same amount of load. From running on a laptop to running on a thousand servers in different datacenters, there is practically no difference from the developer side!

Elasticsearch is also highly available. It has a robust recovery system built in and ready to kick in in the event of a network or node failure, and failures will very likely happen in a distributed computing environment. This makes the index data safe with a very low chance of complete loss.

If a node inside a cluster fails, replicas inside the shard guarantee safe storage of redundant data. If more than one node fails and there is a chance all replicas are lost at the same time, there is still another backup mechanism, the snapshot which is stored in a separate durable storage (such as Amazon S3) which is ready to be restored anytime needed.

Elasticsearch Cluster: Physical View

f:id:tbumi:20171011124924p:plain

The diagram above shows a simple cluster consisting of three nodes. Here, a "node" is a computer that can run java applications. Nodes can be anything from bare metal rack servers, cloud virtual machines, to simple Raspberry Pis (although that is probably not recommended).

Each node holds several shards, which are "containers" of index data. Each shard can either be a primary shard or a replica shard. A primary shard has two roles: indexing new data and serving search query requests. A replica shard can't index new data, it can only serve query results. A primary shard can have zero or more replicas, which serve as redundancy to increase durability and availability of the index.

Elasticsearch has an intelligent feature built in that will automatically balance the number of shards in each node so that each node has approximately the same amount of load. Elasticsearch will also distribute primary and replica shards such that any replica shard will not be in the same node as the primary or other replicas of the same shard.

Elasticsearch Cluster: Logical View

f:id:tbumi:20171011135914p:plain

In the diagram above, we can see the same cluster viewed from an index ("logical") perspective. The same cluster consists of several indices, and each index is split into several shards. Those shards are replicated zero or more times to achieve extra durability. These indices contain the actual data indexed. For example, each index could be a month of logs, or could be a collection of employees' personal info in a company, or a collection of products from a specific store.

Internally, each shard is actually an Apache Lucene "index" (which has a different meaning from an Elasticsearch index). These Lucene indices are the actual processes running in the OS, holding the actual index data, that power Elasticsearch's indexing and searching feature.

You might be wondering, why split the index into several shards? While a cluster can be grown by scaling upwards, which is increasing the size of each node, sooner or later there will be a limit, and the only way to grow a cluster in a distributed environment is to scale horizontally, which is to add the number of nodes. When nodes are added to the cluster, there must be a way to split the data in the indices. Shards are the answer, as several shards can be spread into different nodes.

Contents of an Index

Each index in Elasticsearch stores the index of "documents", which are individual units of data. A document is a JSON object and looks something like this:

{
    "first_name": "John",
    "last_name": "Smith",
    "age": 25,
    "about": "I love to go rock climbing",
    "interests": ["sports", "music"],
    "married": true
}

As can be seen in the example document above, Elasticsearch can store indices of several different types of data, such as strings, integers/floats, arrays, and boolean values. These can all be searched later by the user. Besides that, there is also sentences like "I love to go rock climbing". These sentences can be analyzed by Elasticsearch to improve searchability, with what is called "Text Analyzers".

Text Analysis

Text analysis is performed by Elasticsearch on any index configured to be analyzed. Text analysis basically consists of these two steps:

  1. Tokenizing a block of text into individual terms suitable for use in an inverted index
  2. Normalizing these terms into a standard form to improve their “searchability,” or recall

There are several types of analyzers built in to Elasticsearch, and analyzers can also be custom-tailored to user's specific needs. One of the analyzers built in is the language analyzer, which can analyze a document or string based on a specific language's rules (including word-splitting/tokenizing, stemming, detecting stopwords and synonyms, and the like), which enables smart searching based on natural language processing. This is the list of languages that the latest version of Elasticsearch (5.6 as of time of writing) supports out of the box:

arabic, armenian, basque, brazilian, bulgarian, catalan, cjk (chinese, japanese, and korean), czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

That list is enough to give a picture of how Elasticsearch is batteries included, plug and play. Many of the natural language analysis users are most likely to need already come with Elasticsearch and users just need to add a configuration and their data is ready to be served.

Further Reading

  • Elasticsearch Reference: the one reference for all things related to Elasticsearch, from setup to configuration to querying
  • Elasticsearch: The Definitive Guide: a recommended read for a in-depth explanation of how Elasticsearch works and how to make the most of it, albeit a little old (made when Elasticsearch was still 2.x)