Hadoop is a Java-based programming framework that supports the processing and storage of extremely large data sets on a cluster of cheap machines. It is the first major open source project in the field of big data competition, sponsored by the Apache Software Foundation.
Hadoop consists of four main layers:
The setup of a Hadoop cluster is relatively complicated, so the project contains a standalone mode, suitable for learning Hadoop, performing simple operations and debugging.
In this tutorial, we will install Hadoop in standalone mode and run one of the sample MapReduce programs included in it to verify the installation.
To follow this tutorial, you need:
After completing this preparation, you can install Hadoop and its dependencies.
First, we will update our package list:
sudo apt update
Next, we will install OpenJDK on Ubuntu 18.04, which is the default Java Development Kit:
sudo apt install default-jdk
After the installation is complete, let's check the version.
java -version
openjdk 10.0.12018-04-17
OpenJDK Runtime Environment(build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM(build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)
This output verifies that OpenJDK has been successfully installed.
With Java, we will visit the Apache Hadoop Releases page to find the latest stable version.
Navigate to the binaries of the distribution you want to install. In this guide, we will install Hadoop 3.0.3.
On the next page, right-click and copy the link to the release binary file.
On the server, we will use wget
to get it:
wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
**Note: ** The Apache website will dynamically guide you to the best mirror, so your URL may not match the URL above.
To ensure that the file we downloaded has not been changed, we will use SHA-256 for a quick check. Go back to Version Page, then right-click and copy the link to the checksum file of the release binary file you downloaded:
Again, we will use wget
to download files on our server:
wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz.mds
Then run verification:
shasum -a 256 hadoop-3.0.3.tar.gz
db96e2c0d0d5352d8984892dfac4e27c0e682d98a497b7e04ee97c3e2019277a hadoop-3.0.3.tar.gz
Compare this value with the SHA-256 value in the .mds
file:
cat hadoop-3.0.3.tar.gz.mds
... /build/source/target/artifacts/hadoop-3.0.3.tar.gz:
SHA256 = DB96E2C0 D0D5352D 8984892D FAC4E27C 0E682D98 A497B7E0 4EE97C3E 2019277A
...
You can safely ignore the difference between uppercase and lowercase and spaces. The output of the command we ran against the file downloaded from the mirror should match the value in the file we downloaded from apache.org.
Now that we have verified that the file has not been corrupted or changed, we will use the tar
command with the -x
flag to extract, -z
l to decompress, -v
to get detailed output, and - f
specifies that we extract from the file. Use tab-completion or replace the correct version number in the following command:
tar -xzvf hadoop-3.0.3.tar.gz
Finally, we move the extracted files to the appropriate location in /usr/local
to install the software locally. If necessary, please change the version number to match the version you downloaded.
sudo mv hadoop-3.0.3/usr/local/hadoop
With this software, we can configure its environment.
Hadoop requires you to set the path to Java, which can be an environment variable or a Hadoop configuration file.
The Java path /usr/bin/java
is a symbolic link to /etc/alternatives/java
, which is a symbolic link to the default Java binary file. We will use readlink
with the -f
flag to recursively trace each symbolic link in each part of the path. Then, we will use sed
to adjust bin/java
from the output to provide us with the correct value of JAVA_HOME
.
Find the default Java path
readlink -f /usr/bin/java | sed "s:bin/java::"
/usr/lib/jvm/java-11-openjdk-amd64/
You can copy this output to set Hadoop's Java home directory to this specific version, which ensures that this value will not change if the default Java changes. Alternatively, you can use the readlink
command dynamically in the file so that Hadoop will automatically use any Java version set as the system default.
First, open hadoop-env.sh
:
sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Then, choose one of the following options:
...
# export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/...
...
# export JAVA_HOME=${JAVA_HOME}export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")...
**Note: **For Hadoop, the value of JAVA_HOME
in hadoop-env.sh
will override any value set in the environment /etc/profile
or user profile.
Now we should be able to run Hadoop:
/usr/local/hadoop/bin/hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
where CLASSNAME is a user-provided Java class
OPTIONS is none or any of:
- - config dir Hadoop config directory
- - debug turn on shell script debug mode
- - help usage information
buildpaths attempt to add classfilesfrom build tree
hostnames list[,of,host,names] hosts to use in slave mode
hosts filename list of hosts to use in slave mode
loglevel level set the log4j level forthis command
workers turn on worker mode
SUBCOMMAND is one of:...
Help means that we have successfully configured Hadoop to run in standalone mode. We will ensure that it runs properly by running the sample MapReduce program that comes with it. To do this, please create a directory input
in our home directory and copy the Hadoop configuration files into it to use these files as our data.
mkdir ~/input
cp /usr/local/hadoop/etc/hadoop/*.xml ~/input
Next, we can use the following command to run the MapReduce hadoop-mapreduce-examples
program, which is a Java archive file with multiple options. We will call its grep
program, one of the many examples included in hadoop-mapreduce-examples
, followed by the input directory input
and the output directory grep_example
. The MapReduce grep program will calculate matches of text or regular expressions. Finally, we will provide the regular expression allowed[.]*
to find the occurrence of the word allowed
inside or at the end of the statement. Expressions are case sensitive, so if you capitalize at the beginning of a sentence, we won’t find the word:
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep ~/input ~/grep_example 'allowed[.]*'
After the task is completed, it will provide a summary of what has been processed and the errors encountered, but this does not include the actual results.
...
File System Counters
FILE: Number of bytes read=1330690
FILE: Number of bytes written=3128841
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=2
Map output records=2
Map output bytes=33
Map output materialized bytes=43
Input split bytes=115
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=43
Reduce input records=2
Reduce output records=2
Spilled Records=4
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed(ms)=3
Total committed heap usage(bytes)=478150656
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=147
File Output Format Counters
Bytes Written=34
**Note: **If the output directory already exists, the program will fail, instead of seeing the summary, the output will be similar to:
...
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at org.apache.hadoop.util.RunJar.run(RunJar.java:244)
at org.apache.hadoop.util.RunJar.main(RunJar.java:158)
The results are stored in the output directory and can be checked by running cat
on the output directory:
cat ~/grep_example/*
19 allowed.1 allowed
The MapReduce task found that 19 words allowed
appeared followed by a period, and one did not appear. Running the sample program has verified that our standalone installation is working properly, and that unprivileged users on the system can run Hadoop for exploration or debugging.
In this tutorial, we installed Hadoop in standalone mode and verified it by running the sample programs it provided.
To learn more about the related tutorials on installing Hadoop in standalone mode, please go to [Tencent Cloud + Community] (https://cloud.tencent.com/developer?from=10680) to learn more.
Reference: "How to Install Hadoop in Stand-Alone Mode on Ubuntu 18.04"
Recommended Posts