Copyright statement: This article is an original article by Shaon Puppet. Please indicate the original address for reprinting. Thank you very much. https://blog.csdn.net/wh211212/article/details/53171625

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS has high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high throughput) to access application data, suitable for those with very large data sets (large data set) ) Application; HDFS relaxes the requirements of POSIX, and can access the data in the file system in the form of streaming access.

HDFS stores files in 64MB data blocks at least, which is much larger than the 4KB~32KB blocks in other file systems.
HDFS optimizes throughput on the basis of latency. It can efficiently process read request streams for large files, but it is not good at positioning requests for many small files
HDFS is optimized for ordinary "write once, read many" workloads.
Each storage node runs a process called DataNode, which manages all data blocks on the corresponding host. These storage nodes are coordinated by a main process called NameNode, which runs on an independent process.
Different from the physical redundancy in the disk array to handle disk failures or similar strategies, HDFS uses copies to handle failures. Each data block composed of files is stored on multiple nodes in the crowd. The NameNode of HDFS continuously monitors the data sent from each DataNode. Report.

1、 How MapReduce works
The client, submits MapReduce jobs; jobtracker, coordinates the operation of the job, jobtracker is a java application, its main class is JobTracker; tasktracker. Tasktracker is a java application, and TaskTracker is the main class.
2、 Hadoop advantages
Hadoop is a distributed computing platform that allows users to easily structure and use. Users can easily develop and run applications that process massive amounts of data on Hadoop. It has the following advantages:
High reliability: Hadoop's ability to store and process data bit by bit is trustworthy.
High scalability: Hadoop distributes data and completes calculation tasks among the available computer clusters. These clusters can be easily expanded to thousands of nodes.
High efficiency: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
High fault tolerance: Hadoop can automatically save multiple copies of data and automatically redistribute failed tasks.
Low cost: Compared with all-in-one computers, commercial data warehouses and data marts such as QlikView and Yonghong Z-Suite, hadoop is open source, so the software cost of the project will be greatly reduced.
Hadoop comes with a framework written in Java, so it is ideal to run on a Linux production platform. Applications on Hadoop can also be written in other languages, such as C++.
Hadoop official website: http://hadoop.apache.org/ 2. Prerequisites
Keep the configuration environment of each node in the Hadoop cluster consistent, install java, and configure ssh.
lab environment:
Platform：xen vm
OS: CentOS 6.8
Software: hadoop-2.7.3-src.tar.gz, jdk-8u101-linux-x64.rpm
HostnameIP AddressOS versionHadoop roleNode rolelinux-node1192.168.0.89CentOS 6.8Masternamenodelinux-node2192.168.0.90CentOS 6.8Slavedatenodelinux-node3192.168.0.91CentOS 6.8Slavedatenodelinux-node4192.168.0.92CentOS 6.8Slavedatenode#Download the required software packages and upload them to each cluster On node
Three, cluster architecture and installation****1, Hosts file settings

The hosts file of each node in the Hadoop cluster needs to be modified

[ root@linux-node1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain linux-node1 192.168.0.89 linux-node1 192.168.0.90 linux-node2 192.168.0.91 linux-node3 192.168.0.92 linux-node4
2、 Install java

Upload the downloaded JDK (rpm package) to the server in advance, and then install it

rpm -ivh jdk-8u101-linux-x64.rpm export JAVA_HOME=/usr/java/jdk1.8.0_101/ export PATH=JAVA_HOME/bin:PATH # java -version java version "1.8.0_101" Java(TM) SE Runtime Environment (build 1.8.0_101-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
3、 Install hadoop

Create hadoop user, set to use sudo

[ root@linux-node1 ~]# useradd hadoop && echo hadoop | passwd --stdin hadoop [root@linux-node1 ~]# echo "hadoopALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers [root@linux-node1 ~]# su - hadoop [hadoop@linux-node1 ~]$ cd /usr/local/src/ [hadoop@linux-node1src]wget http://apache.fayea.com/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz [hadoop@linux-node1 src] sudo tar zxvf hadoop-2.7.3.tar.gz -C /home/hadoop/ && cd /home/hadoop [hadoop@linux-node1 home/hadoop]$ sudo mv hadoop-2.7.3/ hadoop [hadoop@linux-node1 home/hadoop]$ sudo chown -R hadoop:hadoop hadoop/

Add the binary directory of hadoop to the PATH variable and set the HADOOP_HOME environment variable

[ hadoop@linux-node1 home/hadoop]$ export HADOOP_HOME=/home/hadoop/hadoop/ [hadoop@linux-node1 home/hadoop]$ export PATH=HADOOP_HOME/bin:PATH
4、 Create hadoop related directories
[ hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/dfs/{name,data} [hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/tmp

Node storage data backup directory

sudo mkdir -p /data/hdfs/{name,data} sudo chown -R hadoop:hadoop /data/

The above operations need to be performed on each node of the hadoop cluster

5、 SSH configuration

Set the cluster master node to log in to other nodes without password

[ hadoop@linux-node1 ~]$ ssh-keygen -t rsa [hadoop@linux-node1 ~]$ ssh-copy-id [email protected] [hadoop@linux-node1 ~]$ ssh-copy-id [email protected] [hadoop@linux-node1 ~]$ ssh-copy-id [email protected]

Test ssh login

6、 Modify the configuration file of hadoop
File location: /home/hadoop/hadoop/etc/hadoop, file name: hadoop-env.sh, yarn-evn.sh, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn -site.xml
(1) Configure hadoop-env.sh file

Under the hadoop installation path, enter the hadoop/etc/hadoop/ directory and edit hadoop-env.sh, modify JAVA_HOME to the installation path of JAVA

[ hadoop@linux-node1 home/hadoop]$ cd hadoop/etc/hadoop/ [hadoop@linux-node1 hadoop]$ egrep JAVA_HOME hadoop-env.sh # The only required environment variable is JAVA_HOME. All others are # set JAVA_HOME in this file, so that it is correctly defined on #export JAVA_HOME={JAVA_HOME} export JAVA_HOME=/usr/java/jdk1.8.0_101/ (2) Configure the yarn.sh file Specify the java operating environment of the Yran framework. This file is the configuration file of the Yarn framework operating environment. You need to modify the location of JAVA_HOME. [ hadoop@linux-node1 hadoop] grep JAVA_HOME yarn-env.sh # export JAVA_HOME=/home/y/libexec/jdk1.6.0/ export JAVA_HOME=/usr/java/jdk1.8.0_101/
(3) Configure slaves file
Specify the DataNode data storage server and write the host names of all DataNode machines into this file, as follows:
[ hadoop@linux-node1 hadoop]$ cat slaves linux-node2 linux-node3 linux-node4
Hadoop 3 operating modes
Local independent mode: All components of Hadoop, such as NameNode, DataNode, Jobtracker, and Tasktracker are all running in a java process.
Pseudo-distributed mode: Each component of Hadoop has a separate Java virtual machine, and they communicate through network sockets.
Fully distributed mode: Hadoop is distributed on multiple hosts, and different components are installed on different guests depending on the nature of the work.

Configure fully distributed mode

(4) Modify the core-site.xml file, add the code in the red area, and pay attention to the content marked in blue
< configuration> fs.default.name hdfs://linux-node1:9000 io.file.buffer.size 131072 hadoop.tmp.dir file:/home/hadoop/tmp Abase for other temporary directories.
(5) Modify the hdfs-site.xml file
< configuration> dfs.namenode.secondary.http-address linux-node1:9001 # View HDFS status through web interface dfs.namenode.name.dir file:/home/hadoop/dfs/name dfs.datanode.data.dir file:/home/hadoop/dfs/data dfs.replication 2 # Each block has 2 backups dfs.webhdfs.enabled true
(6) Modify mapred-site.xml
This is the configuration of the mapreduce task. Since hadoop2.x uses the yarn framework, to achieve distributed deployment, it must be configured as yarn under the mapreduce.framework.name property. mapred.map.tasks and mapred.reduce.tasks are the number of tasks of map and reduce respectively.
[ hadoop@linux-node1 hadoop]$ cp mapred-site.xml.template mapred-site.xml mapreduce.framework.name yarn mapreduce.jobhistory.address linux-node1:10020 mapreduce.jobhistory.webapp.address linux-node1:19888
(7) Configure node yarn-site.xml

This file is the relevant configuration of the yarn architecture

mapred.child.java.opts -Xmx400m yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.aux-services.mapreduce.shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.address linux-node1:8032 yarn.resourcemanager.scheduler.address linux-node1:8030 yarn.resourcemanager.resource-tracker.address linux-node1:8031 yarn.resourcemanager.admin.address linux-node1:8033 yarn.resourcemanager.webapp.address linux-node1:8088 yarn.nodemanager.resource.memory-mb 8192

7、 Copy hadoop to other nodes
scp -r /home/hadoop/hadoop/ 192.168.0.90:/home/hadoop/ scp -r /home/hadoop/hadoop/ 192.168.0.91:/home/hadoop/ scp -r /home/hadoop/hadoop/ 192.168.0.92:/home/hadoop/
8、 Initialize NameNode with hadoop user on linux-node1
/home/hadoop/hadoop/bin/hdfs namenode –format

echo $? #sudo yum –y install tree # tree /home/hadoop/dfs

9、 Start hadoop
/home/hadoop/hadoop/sbin/start-dfs.sh /home/hadoop/hadoop/sbin/stop-dfs.sh

View the process on the namenode node

ps aux | grep --color namenode

View process on DataNode

ps aux | grep --color datanode

10、 Start the yarn distributed computing framework
[ hadoop@linux-node1 .ssh]$ /home/hadoop/hadoop/sbin/start-yarn.sh starting yarn daemons

View the process on the NameNode node

ps aux | grep --color resourcemanager

View process on DataNode node

ps aux | grep --color nodemanager
Note: The two scripts start-dfs.sh and start-yarn.sh can be replaced by start-all.sh
/home/hadoop/hadoop/sbin/stop-all.sh /home/hadoop/hadoop/sbin/start-all.sh

11、 Start jobhistory service, view mapreduce status

On the NameNode

[ hadoop@linux-node1 ~]$ /home/hadoop/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /home/hadoop/hadoop/logs/mapred-hadoop-historyserver-linux-node1.out
12、 View the status of HDFS distributed file system
/home/hadoop/hadoop/bin/hdfs dfsadmin –report

View file block composition, a file is composed of those blocks

/home/hadoop/hadoop/bin/hdfs fsck / -files -blocks

13、 View Hadoop cluster status on web page****View HDFS status:http://192.168.0.89:50070/View Hadoop cluster status:http://192.168.0.89:8088/

Deploy Hadoop cluster services in actual CentOS system