We are living in an advanced stage of the information age. The emergence of the web, mobiles, social networks, blogs, and photo sharing has created a massive amount of data in recent years. These new data sources create information that cannot be handled using traditional data storage technology, typically relational databases. As an application developer or business intelligence developer, your job is to fulfill the search and analytics needs of the application.
A number of big data scale data stores have emerged in the last few years. This includes Hadoop ecosystem projects, several NoSQL databases, and search and analytics engines such as Elasticsearch. Hadoop and each NoSQL database have their own strengths and use cases.
Elastic Stack is a rich ecosystem of components serving as a full search and analytics stack. The main components of Elastic Stack are Kibana, Logstash, Beats, X-Pack, and Elasticsearch. Elasticsearch is at the heart of Elastic Stack, providing storage, search, and analytics capabilities. Kibana, which is also called a window into Elastic Stack, is a great visualization and user interface for Elastic Stack. Logstash and Beats help in getting the data into Elastic Stack. X-Pack provides powerful features including monitoring, alerting, and security to make your system production ready. Since Elasticsearch is at the heart of Elastic Stack, we will cover the stack inside-out, starting from the heart and moving on to the surrounding components.
We will look at what Elasticsearch is and why you should consider it as your data store. Once you know the key strengths of Elasticsearch, we will look at the history of Elasticsearch and its underlying technology, Apache Lucene. We will then look at some use cases of Elastic Stack, and we will provide an overview of the Elastic Stack components.
What is Elasticsearch, and why use it?
Since you are reading this article, you probably already know what Elasticsearch is. For the sake of completeness, let us define Elasticsearch.
Elasticsearch is a realtime, distributed search and analytics engine that is horizontally scalable and capable of solving a wide variety of use cases. At the heart of Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.
Elasticsearch is at the core of Elastic Stack, playing the central role of a search and analytics engine. Elasticsearch is built on a radically different technology, Apache Lucene. This fundamentally different technology in Elasticsearch sets it apart from traditional relational databases and other NoSQL solutions. Let us look at the key benefits of using Elasticsearch as your data store:
- Schemaless, document-oriented
- Rich client library support and the REST API
- Easy to operate and easy to scale
- Near real time
- Lightning fast
- Fault tolerant
Let us look at each benefit one by one.
Elasticsearch does not impose a strict structure on your data; you can store any JSON documents. JSON documents are first class citizens in Elasticsearch as opposed to rows and columns in a relational database. A document is roughly equivalent to a record in a relational database table. Traditional relational databases require a schema to be defined beforehand to specify a fixed set of columns and their datatypes and sizes. Often the nature of data is very dynamic, requiring support for new or dynamic columns. The JSON documents naturally support this type of data. For example, take a look at the following document:
"name": "John Smith",
"address": "121 John Street, NY, 10010",
This document may represent a customer's record. Here the record has the name, address, and age of the customer. Another record may look like the following one:
"name": "John Doe",
Note that the second customer doesn't have the address field, but instead has an email address. In fact, other customer documents may have completely different sets of fields. This provides a tremendous amount of flexibility in terms of what can be stored.
The core strength of Elasticsearch lies in its text processing capabilities. Elasticsearch is great at searching, especially a full-text search. Let us understand what a full-text search is.
Full-text search means searching through all the terms of all the documents available in the database. This requires the entire contents of all documents to be parsed and stored beforehand. When you hear full-text search, think of Google Search. You can enter any search term and Google looks through all of the web pages on the internet to find the best matching web pages. This is quite different from simple SQL queries run against columns of type string in relational databases. Normal SQL queries with a WHERE clause and an equals (=) or LIKE clause try to do an exact or wild-card match with underlying data. SQL queries can, at best, just match the search term to a sub-string within the text column.
When you want to perform a search similar to Google search on your own data, Elasticsearch is your best bet. You can index emails, text documents, PDF files, web pages, or practically any unstructured text documents and search across all your documents with search terms.
At a high level, Elasticsearch breaks up text data into terms and makes every term searchable by building Lucene indexes. You can build your own Google-like search for your application which is very fast and flexible.
In addition to supporting text data, Elasticsearch also supports other data types such as numbers, dates, geolocations, IP addresses, and many more.
Apart from search, the second most important functional strength of Elasticsearch is analytics. Yes, what was originally known just as a full-text search engine is now used as an analytics engine in a variety of use cases. Many organizations are running analytics solutions powered by Elasticsearch in production.
Search is like zooming in and finding a needle in a haystack. Search helps zoom in on precisely what is needed in huge amounts of data. Analytics is exactly the opposite of search; it is about zooming out and taking a look at the bigger picture. For example, you may want to know how many visitors on your website are from the United States as opposed to every other country, or you may want to know how many of your websites visitors use macOS, Windows, or Linux.
Elasticsearch supports a wide variety of aggregations for analytics. Elasticsearch aggregations are quite powerful and can be applied to various datatypes.
Elasticsearch can run on a single node and easily scale out to hundreds of nodes. It is very easy to start a single node instance of Elasticsearch; it works out of the box without any configuration changes and scales to hundreds of nodes.
Horizontal scalability is the ability to scale a system horizontally by starting up multiple instances of the same type rather than making one instance more and more powerful. Vertical scaling is about upgrading a single instance by adding more processing power (by increasing the number of CPUs or CPU cores), memory, or storage capacity. There is a practical limit to how much a system can be scaled vertically due to cost and other factors, such as the availability of higher end hardware.
Unlike most traditional databases which only allow vertical scaling, Elasticsearch can be scaled horizontally. It can run on tens or hundreds of commodity nodes instead of one extremely expensive server. Adding a node to an existing Elasticsearch cluster is as easy as starting up a new node in the same network, with virtually no extra configuration. The client application doesn't need to change, whether it is running against a single node or a hundred node cluster.
Data is available for querying typically within a second after it has been indexed (saved). Not all big data storage systems are real-time capable. Elasticsearch allows you to index thousands to hundreds of thousands of documents per second and makes them available for searching almost immediately.
Additionally, it has a very rich REST (Representational State Transfer) API which works on an HTTP protocol. The REST API is very well documented and quite comprehensive, making all operations available over HTTP.
All this means that Elasticsearch is very easy to integrate in any application to fulfill your search and analytics needs.
Elasticsearch uses Apache Lucene as its underlying technology. By default, Elasticsearch indexes all the fields of your documents. This is extremely invaluable as you can query or search by any field in your records. You will never be in a situation in which you think if only I had chosen to create an index on this field. Elasticsearch contributors have leveraged Apache Lucene to its best advantage, and there are other optimizations which make it lightning fast.
Elasticsearch clusters can keep running even when there are hardware failures such as node failure and network failure. In the case of node failure, it replicates all the data that was on the failed node to another node in the cluster. In the case of network failure, Elasticsearch seamlessly elects master replicas to keep the cluster running. Whether it is node or network failure, you can rest assured that your data is safe.
Now that you know when and why Elasticsearch could be a great choice, let us take a high level view of the ecosystem—the Elastic Stack.
Exploring the components of Elastic Stack
The Elastic Stack components are shown in the following figure. It is not necessary to include all of them in your solution. Some components are general purpose and they can be used outside of Elastic Stack without using any of the other components.
Let us look at the purpose of each component and how they fit in the stack:
Elasticsearch is at the heart of Elastic Stack. It stores all your data and provides search and analytic capabilities in a scalable way. We have already looked at the strengths of Elasticsearch and why you would want to use it. Elasticsearch can be used without using any other components to power your application in terms of search and analytics.
Logstash helps in centralizing event data such as logs, metrics, or any other data in any format. It can perform a number of transformations before sending it to a stash of your choice. It is a key component of Elastic Stack, used to centralize the collection and transformation processes in your data pipeline.
Logstash is a server side component. Its role is to centralize the collection of data from a wide number of input sources in a scalable way, and transform and send the data to an output of your choice. Typically, the output is sent to Elasticsearch, but Logstash is capable of sending it to a wide variety of outputs. Logstash has a plugin-based, extensible architecture. It supports three types of plugin: input plugins, filter plugins, and output plugins. Logstash has a collection of 200 plus supported plugins and the count is ever increasing.
Logstash is an excellent general purpose data flow engine which helps in building real-time, scalable data pipelines.
Beats is a platform of open source lightweight data shippers. Its role is complementary to Logstash. Logstash is a server-side component, whereas Beats has a role on the client side. Beats consists of a core library, libbeat, which provides an API for shipping data from the source, configuring the input options, and implementing logging. Beats is installed on machines that are not part of server-side components such as Elasticsearch, Logstash, or Kibana. These agents reside on non-cluster nodes which may also be called edge nodes sometimes.
There are many Beat components that have already been built by the Elastic team and the open source community. The Elastic team has built Beats including, Packetbeat, Filebeat, Metricbeat, Winlogbeat, Audiobeat, and Heartbeat.
Filebeat is a single-purpose Beat built to ship log files from your servers to a centralized Logstash server or Elasticsearch server. Metricbeat is a server monitoring agent that periodically collects metrics from the operating systems and services running on your servers. There are already around 40 community Beats built for specific purposes such as monitoring Elasticsearch, Cassandra, the Apache web server, JVM performance, and so on. You can build your own beat using libbeat if you don't find one that fits your needs.
Kibana is the visualization tool of Elastic Stack which can help you gain powerful insights about your data in Elasticsearch. It is often called a window into Elastic Stack. It offers many visualizations including histograms, maps, line charts, time series, and more. You can build visualizations with just a few clicks and interactively explore the data. It lets you build beautiful dashboards by combining different visualizations, sharing with others, and exporting high quality reports.
Kibana also has management and development tools. You can manage settings and configure X‑Pack security features for the Elastic Stack. Kibana also has development tools which enable developers to build and test REST API requests.
X-Pack adds essential features to make Elastic Stack production ready. It adds security, monitoring, alerting, reporting, and graph capabilities to Elastic Stack.
The security plugin within X-Pack adds authentication and authorization capabilities to Elasticsearch and Kibana so that only authorized people have access to the data, and they see only what they are allowed to see. The security plugin works across components seamlessly, securing access to Elasticsearch and Kibana.
The security extension also lets you configure fields and document level security with the licensed version.
You can monitor your Elastic Stack components so that there is no downtime. The monitoring component in X-Pack lets you monitor your Elasticsearch clusters and Kibana.
You can monitor clusters, nodes, and index level metrics. The monitoring plugin maintains a history of performance so that you can compare the current metrics with the past metrics. It also has a capacity planning feature.
The reporting plugin within X-Pack allows for generating printable, high-quality reports from Kibana visualizations. The reports can be scheduled to run periodically or on a per event basis.
X-Pack has sophisticated alerting capabilities that can alert you in multiple possible ways when certain conditions are met. It gives tremendous flexibility in terms of when, how, and who to alert.
You may be interested in detecting security breaches, such as when someone has five login failures within an hour from different locations, or when your product is trending on social media. You can use the full power of Elasticsearch queries to check when complex conditions are met.
Alerting provides a wide variety of options in terms of how alerts are sent. It can send alerts via email, Slack, Hipchat, and PagerDuty.
Graph lets you explore relationships in your data. The data in Elasticsearch is generally perceived as a flat list of entities without connections to other entities. This relationship opens up the possibility of new use cases. Graph can surface relationships among entities which share common properties such as people, places, products, or preferences.
Graph consists of Graph API and a UI within Kibana to let you explore this relationship. Under the hood, it leverages distributed querying, indexing at scale, and the relevance capabilities of Elasticsearch.
Elastic Cloud is the cloud-based, hosted, and managed setup of Elastic Stack components. The service is provided by the company Elastic (https://www.elastic.co/). Elastic is the company behind the development of Elasticsearch and other Elastic Stack components. All Elastic Stack components are open source except X-Pack (and Elastic Cloud). The company Elastic provides services for Elastic Stack components including training, development, support, and cloud hosting.
Apart from Elastic Cloud, there are other hosted solutions available for Elasticsearch including one from Amazon Web Services (AWS). The advantage of Elastic Cloud is that it is developed and maintained by the original creators of Elasticsearch and other Elastic Stack components.