How to install Hadoop in standalone mode on Ubuntu 18.04

Introduction

Hadoop is a Java-based programming framework that supports the processing and storage of extremely large data sets on a cluster of cheap machines. It is the first major open source project in the field of big data competition, sponsored by the Apache Software Foundation.

Hadoop consists of four main layers:

Hadoop Common is a collection of utilities and libraries that support other Hadoop modules.
HDFS stands for Hadoop Distributed File System and is responsible for saving data to disk.
YARN is the abbreviation of Yet Another Resource Negotiator and is the "operating system" of HDFS.
MapReduce is the original processing model of the Hadoop cluster. It distributes the work in the cluster or map, and then organizes and reduces the results of the nodes into responses to queries. Many other processing models are available for the 3.x version of Hadoop.

The setup of a Hadoop cluster is relatively complicated, so the project contains a standalone mode, suitable for learning Hadoop, performing simple operations and debugging.

In this tutorial, we will install Hadoop in standalone mode and run one of the sample MapReduce programs included in it to verify the installation.

Preparation

To follow this tutorial, you need:

An Ubuntu 18.04 server with a non-root user with sudo privileges. Students who don’t have a server can buy it from here, but I personally recommend you to use the free Tencent Cloud Developer Lab for experimentation, and then buy server.

After completing this preparation, you can install Hadoop and its dependencies.

Step 1-Install Java

First, we will update our package list:

sudo apt update

Next, we will install OpenJDK on Ubuntu 18.04, which is the default Java Development Kit:

sudo apt install default-jdk

After the installation is complete, let's check the version.

java -version

openjdk 10.0.12018-04-17
OpenJDK Runtime Environment(build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM(build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)

This output verifies that OpenJDK has been successfully installed.

Step 2-Install Hadoop

With Java, we will visit the Apache Hadoop Releases page to find the latest stable version.

Navigate to the binaries of the distribution you want to install. In this guide, we will install Hadoop 3.0.3.

On the next page, right-click and copy the link to the release binary file.

On the server, we will use wget to get it:

wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz

**Note: ** The Apache website will dynamically guide you to the best mirror, so your URL may not match the URL above.

To ensure that the file we downloaded has not been changed, we will use SHA-256 for a quick check. Go back to Version Page, then right-click and copy the link to the checksum file of the release binary file you downloaded:

Again, we will use wget to download files on our server:

wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz.mds

Then run verification:

shasum -a 256 hadoop-3.0.3.tar.gz

db96e2c0d0d5352d8984892dfac4e27c0e682d98a497b7e04ee97c3e2019277a  hadoop-3.0.3.tar.gz

Compare this value with the SHA-256 value in the .mds file:

cat hadoop-3.0.3.tar.gz.mds

... /build/source/target/artifacts/hadoop-3.0.3.tar.gz:
SHA256 = DB96E2C0 D0D5352D 8984892D FAC4E27C 0E682D98 A497B7E0 4EE97C3E 2019277A
...

You can safely ignore the difference between uppercase and lowercase and spaces. The output of the command we ran against the file downloaded from the mirror should match the value in the file we downloaded from apache.org.

Now that we have verified that the file has not been corrupted or changed, we will use the tar command with the -x flag to extract, -zl to decompress, -v to get detailed output, and - f specifies that we extract from the file. Use tab-completion or replace the correct version number in the following command:

tar -xzvf hadoop-3.0.3.tar.gz

Finally, we move the extracted files to the appropriate location in /usr/local to install the software locally. If necessary, please change the version number to match the version you downloaded.

sudo mv hadoop-3.0.3/usr/local/hadoop

With this software, we can configure its environment.

Step 3-Configure Hadoop's Java Home

Hadoop requires you to set the path to Java, which can be an environment variable or a Hadoop configuration file.

The Java path /usr/bin/java is a symbolic link to /etc/alternatives/java, which is a symbolic link to the default Java binary file. We will use readlink with the -f flag to recursively trace each symbolic link in each part of the path. Then, we will use sed to adjust bin/java from the output to provide us with the correct value of JAVA_HOME.

Find the default Java path

readlink -f /usr/bin/java | sed "s:bin/java::"

/usr/lib/jvm/java-11-openjdk-amd64/

You can copy this output to set Hadoop's Java home directory to this specific version, which ensures that this value will not change if the default Java changes. Alternatively, you can use the readlink command dynamically in the file so that Hadoop will automatically use any Java version set as the system default.

First, open hadoop-env.sh:

sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Then, choose one of the following options:

Option 1: Set static value

...
# export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/...

Option 2: Use Readlink to dynamically set the value

...
# export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")...

**Note: **For Hadoop, the value of JAVA_HOME in hadoop-env.sh will override any value set in the environment /etc/profile or user profile.

Step 4-Run Hadoop

Now we should be able to run Hadoop:

/usr/local/hadoop/bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
 where CLASSNAME is a user-provided Java class

 OPTIONS is none or any of:

- - config dir                     Hadoop config directory
- - debug                          turn on shell script debug mode
- - help                           usage information
buildpaths                       attempt to add classfilesfrom build tree
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level forthis command
workers                          turn on worker mode

 SUBCOMMAND is one of:...

Help means that we have successfully configured Hadoop to run in standalone mode. We will ensure that it runs properly by running the sample MapReduce program that comes with it. To do this, please create a directory input in our home directory and copy the Hadoop configuration files into it to use these files as our data.

mkdir ~/input
cp /usr/local/hadoop/etc/hadoop/*.xml ~/input

Next, we can use the following command to run the MapReduce hadoop-mapreduce-examples program, which is a Java archive file with multiple options. We will call its grep program, one of the many examples included in hadoop-mapreduce-examples, followed by the input directory input and the output directory grep_example. The MapReduce grep program will calculate matches of text or regular expressions. Finally, we will provide the regular expression allowed[.]* to find the occurrence of the word allowed inside or at the end of the statement. Expressions are case sensitive, so if you capitalize at the beginning of a sentence, we won’t find the word:

/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep ~/input ~/grep_example 'allowed[.]*'

After the task is completed, it will provide a summary of what has been processed and the errors encountered, but this does not include the actual results.

...
  File System Counters
  FILE: Number of bytes read=1330690
  FILE: Number of bytes written=3128841
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
 Map-Reduce Framework
  Map input records=2
  Map output records=2
  Map output bytes=33
  Map output materialized bytes=43
  Input split bytes=115
  Combine input records=0
  Combine output records=0
  Reduce input groups=2
  Reduce shuffle bytes=43
  Reduce input records=2
  Reduce output records=2
  Spilled Records=4
  Shuffled Maps =1
  Failed Shuffles=0
  Merged Map outputs=1
  GC time elapsed(ms)=3
  Total committed heap usage(bytes)=478150656
 Shuffle Errors
  BAD_ID=0
  CONNECTION=0
  IO_ERROR=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters
  Bytes Read=147
 File Output Format Counters
  Bytes Written=34

**Note: **If the output directory already exists, the program will fail, instead of seeing the summary, the output will be similar to:

...
 at java.base/java.lang.reflect.Method.invoke(Method.java:564)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:244)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:158)

The results are stored in the output directory and can be checked by running cat on the output directory:

cat ~/grep_example/*

19 allowed.1   allowed

The MapReduce task found that 19 words allowed appeared followed by a period, and one did not appear. Running the sample program has verified that our standalone installation is working properly, and that unprivileged users on the system can run Hadoop for exploration or debugging.

in conclusion

In this tutorial, we installed Hadoop in standalone mode and verified it by running the sample programs it provided.

To learn more about the related tutorials on installing Hadoop in standalone mode, please go to [Tencent Cloud + Community] (https://cloud.tencent.com/developer?from=10680) to learn more.

Reference: "How to Install Hadoop in Stand-Alone Mode on Ubuntu 18.04"