Elasticsearch is a popular open source search server for real-time distributed search and data analysis. When used for any tasks other than development, Elasticsearch should be deployed as a cluster across multiple servers for optimal performance, stability, and scalability.
This tutorial will show you how to install and configure a production Elasticsearch cluster on Ubuntu 14.04 in the cloud server environment.
Although manually setting up an Elasticsearch cluster is useful for learning, it is strongly recommended to use configuration management tools in any cluster setup.
You must have at least three Ubuntu 14.04 servers to complete this tutorial, because the Elasticsearch cluster should have at least 3 nodes that qualify as the master node. If you want to have a dedicated master node and data node, the master node needs at least 3 servers, and the data node needs additional servers. Students who don’t have a server can buy it from here, but I personally recommend you to use the free Tencent Cloud Developer Lab for experimentation, and then buy server.
If you prefer to use CentOS, please check this tutorial: How to set up a production Elasticsearch cluster on CentOS 7
This tutorial assumes that your server is using a V** network, no matter what kind of physical network your server uses, this will provide a dedicated network function.
If you are using a shared private network, you must use V** to protect Elasticsearch from unauthorized access. Each server must be on the same private network, because Elasticsearch has no built-in security in its HTTP interface. Do not share a private network with any computer you do not trust.
We refer to the V** IP address of your server as V**_ip
. We also assume that they both have a V** interface named "tun0" as described in the tutorial linked above.
Elasticsearch requires Java, so we will install it now. We will install the latest version of Oracle Java 8, as this is recommended by Elasticsearch. However, if you decide to go this route, it should work with OpenJDK.
Complete this step on all Elasticsearch servers. :
sudo add-apt-repository -y ppa:webupd8team/java
Update your apt package database:
sudo apt-get update
Use this command to install the latest stable version of Oracle Java 8 (and accept the pop-up license agreement):
sudo apt-get-y install oracle-java8-installer
Be sure to repeat this step on all Elasticsearch servers.
Now that Java 8 is installed, let's install ElasticSearch.
By adding Elastic's package source list, Elasticsearch can be installed with the package manager. Complete this step on all Elasticsearch servers.
Run the following command to import the Elasticsearch public GPG key into apt:
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
If your prompt just hangs there, it may be waiting for your user password (authorize the sudo
command). If this is the case, please enter your password.
echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main"| sudo tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
Update your apt package database:
sudo apt-get update
Install Elasticsearch using the following command:
sudo apt-get-y install elasticsearch
Be sure to repeat these steps on all Elasticsearch servers.
Elasticsearch is now installed, but you need to configure it before you can use it.
Now it's time to edit the Elasticsearch configuration. Complete these steps on all Elasticsearch servers.
Open the Elasticsearch configuration file for editing:
sudo vi /etc/elasticsearch/elasticsearch.yml
The subsequent sections will explain how to modify the configuration.
You need to restrict external access to the Elasticsearch instance, so outsiders cannot access your data or shut down your Elasticsearch cluster via HTTP API. In other words, you must configure Elasticsearch so that it only allows access to servers on your private network (V**). For this, we need to configure each node to bind to the V** IP address V**_ip
, or the interface "tun0".
Find the specified network.host
line, uncomment it, and replace its value with the corresponding server's V** IP address (for example, replace node01 with 10.0.0.1
) or interface name. Since our V** interface is named "tun0" on all servers, we can configure all servers with the same line:
network.host:[_tun0_, _local_]
Please note that adding "local" will configure Elasticsearch to also listen to all loopback devices. This will allow you to use the Elasticsearch HTTP API locally by sending a request to localhost
from each server. If you do not include this item, Elasticsearch will only respond to requests for V** IP addresses.
**Warning: **Since Elasticsearch does not have any built-in security, it is very important not to set this to an IP address accessible by any server that you cannot control or trust. Do not bind Elasticsearch to a public or shared private network IP address!
Next, set the name of the cluster, which will allow your Elasticsearch nodes to join and form a cluster. You will need to use a unique and descriptive name (in your network).
Find the specified cluster.name
line, uncomment it, and replace its value with the desired cluster name. In this tutorial, we will name our cluster "production":
cluster.name: production
Next, we will set the name of each node. This should be a descriptive name that is unique in the cluster.
Find the specified node.name
line, uncomment it, and replace its value with the desired node name. In this tutorial, we will use the ${HOSTNAME}
environment variable to set each node name to the server's hostname:
node.name: ${HOSTNAME}
If you want, you can name the node manually, but make sure to specify a unique name. If you don't mind naming your nodes randomly, you can also comment out node.name
.
Next, you need to configure an initial list of nodes, which will be contacted to discover and form a cluster. This is necessary in unicast networks.
Find the specified and uncommented discovery.zen.ping.unicast.hosts
line. Replace its value with a string array of V** IP addresses or host names (resolved to V** IP addresses) of all other nodes.
For example, if you have three servers node01
, node02
and node03
and their respective V** IP addresses 10.0.0.1
, 10.0.0.2
and 10.0.0.3
, you can use this line:
discovery.zen.ping.unicast.hosts:["10.0.0.1","10.0.0.2","10.0.0.3"]
Or, if all your servers are configured with name-based V** IP address resolution (via DNS or /etc/hosts
), you can use the following line:
discovery.zen.ping.unicast.hosts:["node01","node02","node03"]
Note: Ansible Playbook will automatically create an entry of /etc/hosts
on each server, and resolve the stock hostname of each V server (specified in the Ansible hosts
file) to it V IP address.
Your server is now configured to form a basic Elasticsearch cluster. You need to update more settings, but we will see these settings after we verify that the cluster is working properly.
Save and exit elasticsearch.yml
.
Now start Elasticsearch:
sudo service elasticsearch restart
Then run this command to start Elasticsearch on boot:
sudo update-rc.d elasticsearch defaults 9510
Be sure to repeat these steps (Configure Elasticsearch Cluster) on all Elasticsearch servers.
If everything is configured correctly, your Elasticsearch cluster should be up and running. Before proceeding, let us verify that it is working properly. You can do this by querying Elasticsearch from any Elasticsearch node.
From any Elasticsearch server, run this command to print the status of the cluster:
curl -XGET 'http://localhost:9200/_cluster/state?pretty'
You should see output indicating that the cluster named "production" is running. It should also indicate that all nodes you configure are members:
{" cluster_name":"production","version":36,"state_uuid":"MIkS5sk7TQCl31beb45kfQ","master_node":"k6k2UObVQ0S-IFoRLmDcvA","blocks":{},"nodes":{"Jx_YC2sTQY6ayACU43_i3Q":{"name":"node02","transport_address":"10.0.0.2:9300","attributes":{}},"k6k2UObVQ0S-IFoRLmDcvA":{"name":"node01","transport_address":"10.0.0.1:9300","attributes":{}},"kQgZZUXATkSpduZxNwHfYQ":{"name":"node03","transport_address":"10.0.0.3:9300","attributes":{}}},...
If you see output similar to this, your Elasticsearch cluster is running! If any nodes are missing, check the configuration of the relevant nodes before continuing.
Next, we will introduce some configuration settings that you should consider for your Elasticsearch cluster.
Elastic recommends avoiding swapping Elasticsearch processes at all costs as it will negatively affect performance and stability. One way to avoid excessive swapping is to configure Elasticsearch to lock the memory it needs.
Complete this step on all Elasticsearch servers.
Edit Elasticsearch configuration:
sudo vi /etc/elasticsearch/elasticsearch.yml
Find the specified and uncommented bootstrap.mlockall
line:
bootstrap.mlockall:true
Save and exit.
Next, open the /etc/default/elasticsearch
file for editing:
sudo vi /etc/default/elasticsearch
First, look for ES_HEAP_SIZE
, uncomment it, and set it to about 50% of available memory. For example, if you have about 4 GB of free space, you should set it to 2 GB (2g
):
ES_HEAP_SIZE=2g
Next, find and uncomment MAX_LOCKED_MEMORY=unlimited
. It should look like this when you are done:
MAX_LOCKED_MEMORY=unlimited
Save and exit.
Now restart Elasticsearch to put the changes in place:
sudo service elasticsearch restart
Be sure to repeat this step on all Elasticsearch servers.
To verify if mlockall
is running on all Elasticsearch nodes, run this command from any node:
curl http://localhost:9200/_nodes/process?pretty
Each node should have a line stating "mlockall": true
, indicating that memory locking is enabled and working properly:
..." nodes":{"kQgZZUXATkSpduZxNwHfYQ":{"name":"es03","transport_address":"10.0.0.3:9300","host":"10.0.0.3","ip":"10.0.0.3","version":"2.2.0","build":"8ff36d1","http_address":"10.0.0.3:9200","process":{"refresh_interval_in_millis":1000,"id":1650,"mlockall":true}...
If mlockall
is false on any node, please check the node's settings and restart Elasticsearch. The common reason why Elasticsearch cannot start is that ES_HEAP_SIZE
is set too high.
By default, your Elasticsearch node should have an "open file descriptor limit" of 64k. This section will show you how to verify this, and you can add it if you want.
First, find the process ID (PID) of the Elasticsearch process. A simple way is to use the ps
command to list all processes belonging to the elasticsearch
user:
ps -u elasticsearch
You should see output that looks like this. The number in the first column is the PID of the Elasticsearch (java) process:
PID TTY TIME CMD
11708?00:00:10 java
Then run this command to display the open file limit of the Elasticsearch process (replace the highlighted number with your own PID from the previous step):
cat /proc/11708/limits | grep 'Max open files'
Max open files 6553565535 files
The numbers in the second and third columns represent the soft limit and the hard limit respectively, which are 64k (65535). This is fine for many settings, but you may want to increase this setting.
To increase the maximum number of open file descriptors in Elasticsearch, you only need to change a single setting.
Open the /etc/default/elasticsearch
file for editing:
sudo vi /etc/default/elasticsearch
Find MAX_OPEN_FILES
, uncomment it, and set it to the limit you want. For example, if you want to limit 128k descriptors, change it to 131070
:
MAX_OPEN_FILES=131070
Save and exit.
Now restart Elasticsearch to put the changes in place:
sudo service elasticsearch restart
Then follow the instructions in the previous section to verify that the limit has been increased.
Be sure to repeat this step on any Elasticsearch server that requires a higher file descriptor limit.
There are two common types of Elasticsearch nodes: master and data. The master node performs cluster-wide operations such as managing indexes and determining which data nodes should store specific data slices. The data node saves the fragments of the index document, and handles CRUD, search and aggregation operations. As a general rule, data nodes consume a lot of CPU, memory and I/O.
By default, each Elasticsearch node is configured as a data node that "fits the master node", which means that they store data (and perform resource-intensive operations) and may be selected as the master node. For a small cluster, this is usually fine; however, a large Elasticsearch cluster should be configured with a dedicated master node so that the stability of the master node will not be affected by the work of dense data nodes.
Before configuring a dedicated master node, make sure that your cluster has at least 3 nodes that match the master node. This is very important to avoid split-brain situations, which can cause data inconsistencies when the network fails.
To configure a dedicated master node, edit the Elasticsearch configuration of the node:
sudo vi /etc/elasticsearch/elasticsearch.yml
Add the following two lines:
node.master:true
node.data:false
The first line node.master: true
is used to specify that the node matches the master node, which is actually the default setting. The second line node.data: false
will restrict the node to become a data node.
Save and exit.
Now restart the Elasticsearch node for the changes to take effect:
sudo service elasticsearch restart
Be sure to repeat this step on other dedicated master nodes.
You can use the following command to query the cluster to see which nodes are configured as dedicated master nodes: curl -XGET'http://localhost:9200/_cluster/state?pretty'
. Any node with data: false
and master: true
is a dedicated master node.
To configure a dedicated data node-a data node that does not meet the main conditions-edit the Elasticsearch configuration of the node:
sudo vi /etc/elasticsearch/elasticsearch.yml
Add the following two lines:
node.master:false
node.data:true
The first line node.master: false
specifies that the node does not match the master node. The second line node.data: true
is the default setting, allowing nodes to be used as data nodes.
Save and exit.
Now restart the Elasticsearch node for the changes to take effect:
sudo service elasticsearch restart
Be sure to repeat this step on other dedicated data nodes.
You can use the following command to query the cluster to see which nodes are configured as dedicated data nodes: curl -XGET'http://localhost:9200/_cluster/state?pretty'
. List master: false
and any node **not listed
data: false` are dedicated data nodes.
When running an Elasticsearch cluster, it must be set to the minimum number of nodes that the cluster needs to run in line with the master node for normal operation. This is sometimes called arbitration. This is to ensure data consistency in the event that one or more nodes lose connection with the rest of the cluster, thereby preventing the so-called "split brain" situation.
To calculate the minimum number of primary nodes that a cluster should have, calculate n / 2 + 1
, where n is the total number of "primarily eligible" nodes in a healthy cluster, and then round the result down to the nearest integer . For example, for a 3-node cluster, the quorum is 2.
**Note: **Make sure to include all eligible nodes in the arbitration calculation, including any data nodes that meet the primary conditions (default setting).
The minimum master node setting can be dynamically set through the Elasticsearch HTTP API. To do this, run this command on any node (replace the highlighted number with your quorum):
curl -XPUT localhost:9200/_cluster/settings?pretty -d '{"persistent":{"discovery.zen.minimum_master_nodes":2}}'
Output:{"acknowledged":true,"persistent":{"discovery":{"zen":{"minimum_master_nodes":"2"}}},"transient":{}}
**Note: **This command is a "persistent" setting, which means that the minimum master node setting will continue to exist after a full cluster restart and overwrite the Elasticsearch configuration file. In addition, this setting can be specified as discovery.zen.minimum_master_nodes: 2
in /etc/elasticsearch.yml
, if you have not set dynamics yet.
If you want to check this setting later, you can run the following command:
curl -XGET localhost:9200/_cluster/settings?pretty
You can access the Elasticsearch HTTP API by sending a request to the V** IP address of any node, or as shown in the tutorial, by sending a request from one of the nodes to localhost
to access the Elasticsearch HTTP API.
The client server can access your Elasticsearch cluster through the V** IP address of any node, which means that the client server must also be part of the V**.
If you have other software that needs to connect to the cluster (such as Kibana or Logstash), you can usually configure the connection by providing the application with the V** IP addresses of one or more Elasticsearch nodes.
Your Elasticsearch cluster should be running in a healthy state and configured with some basic optimizations!
Elasticsearch has many other configuration options not covered here, such as indexing, sharding, and replication settings. It is recommended that you revisit the configuration and official documentation later to ensure that your cluster configuration meets your needs.
For more Ubuntu tutorials, please go to [Tencent Cloud + Community] (https://cloud.tencent.com/developer?from=10680) to learn more.
Reference: "How To Set Up a Production Elasticsearch Cluster on Ubuntu 14.04"
Recommended Posts