Copyright statement: This article is an original article by Shaon Puppet. Please indicate the original address for reprinting. Thank you very much. https://blog.csdn.net/wh211212/article/details/53171625
Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS has high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high throughput) to access application data, suitable for those with very large data sets (large data set) ) Application; HDFS relaxes the requirements of POSIX, and can access the data in the file system in the form of streaming access. |
---|
1、 How MapReduce works
The client, submits MapReduce jobs; jobtracker, coordinates the operation of the job, jobtracker is a java application, its main class is JobTracker; tasktracker. Tasktracker is a java application, and TaskTracker is the main class.
2、 Hadoop advantages
Hadoop is a distributed computing platform that allows users to easily structure and use. Users can easily develop and run applications that process massive amounts of data on Hadoop. It has the following advantages:
High reliability: Hadoop's ability to store and process data bit by bit is trustworthy.
High scalability: Hadoop distributes data and completes calculation tasks among the available computer clusters. These clusters can be easily expanded to thousands of nodes.
High efficiency: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
High fault tolerance: Hadoop can automatically save multiple copies of data and automatically redistribute failed tasks.
Low cost: Compared with all-in-one computers, commercial data warehouses and data marts such as QlikView and Yonghong Z-Suite, hadoop is open source, so the software cost of the project will be greatly reduced.
Hadoop comes with a framework written in Java, so it is ideal to run on a Linux production platform. Applications on Hadoop can also be written in other languages, such as C++.
Hadoop official website: http://hadoop.apache.org/ 2. Prerequisites
Keep the configuration environment of each node in the Hadoop cluster consistent, install java, and configure ssh.
lab environment:
Platform:xen vm
OS: CentOS 6.8
Software: hadoop-2.7.3-src.tar.gz, jdk-8u101-linux-x64.rpm
HostnameIP AddressOS versionHadoop roleNode rolelinux-node1192.168.0.89CentOS 6.8Masternamenodelinux-node2192.168.0.90CentOS 6.8Slavedatenodelinux-node3192.168.0.91CentOS 6.8Slavedatenodelinux-node4192.168.0.92CentOS 6.8Slavedatenode#Download the required software packages and upload them to each cluster On node
Three, cluster architecture and installation****1, Hosts file settings
[ root@linux-node1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain linux-node1 192.168.0.89 linux-node1 192.168.0.90 linux-node2 192.168.0.91 linux-node3 192.168.0.92 linux-node4
2、 Install java
rpm -ivh jdk-8u101-linux-x64.rpm export JAVA_HOME=/usr/java/jdk1.8.0_101/ export PATH=
3、 Install hadoop
[ root@linux-node1 ~]# useradd hadoop && echo hadoop | passwd --stdin hadoop [root@linux-node1 ~]# echo "hadoopALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers [root@linux-node1 ~]# su - hadoop [hadoop@linux-node1 ~]$ cd /usr/local/src/ [hadoop@linux-node1src]
[ hadoop@linux-node1 home/hadoop]$ export HADOOP_HOME=/home/hadoop/hadoop/ [hadoop@linux-node1 home/hadoop]$ export PATH=
4、 Create hadoop related directories
[ hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/dfs/{name,data} [hadoop@linux-node1 ~]$ mkdir -p /home/hadoop/tmp
sudo mkdir -p /data/hdfs/{name,data} sudo chown -R hadoop:hadoop /data/
5、 SSH configuration
[ hadoop@linux-node1 ~]$ ssh-keygen -t rsa [hadoop@linux-node1 ~]$ ssh-copy-id [email protected] [hadoop@linux-node1 ~]$ ssh-copy-id [email protected] [hadoop@linux-node1 ~]$ ssh-copy-id [email protected]
6、 Modify the configuration file of hadoop
File location: /home/hadoop/hadoop/etc/hadoop, file name: hadoop-env.sh, yarn-evn.sh, slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn -site.xml
(1) Configure hadoop-env.sh file
[ hadoop@linux-node1 home/hadoop]$ cd hadoop/etc/hadoop/ [hadoop@linux-node1 hadoop]$ egrep JAVA_HOME hadoop-env.sh # The only required environment variable is JAVA_HOME. All others are # set JAVA_HOME in this file, so that it is correctly defined on #export JAVA_HOME=
(3) Configure slaves file
Specify the DataNode data storage server and write the host names of all DataNode machines into this file, as follows:
[ hadoop@linux-node1 hadoop]$ cat slaves linux-node2 linux-node3 linux-node4
Hadoop 3 operating modes
Local independent mode: All components of Hadoop, such as NameNode, DataNode, Jobtracker, and Tasktracker are all running in a java process.
Pseudo-distributed mode: Each component of Hadoop has a separate Java virtual machine, and they communicate through network sockets.
Fully distributed mode: Hadoop is distributed on multiple hosts, and different components are installed on different guests depending on the nature of the work.
(4) Modify the core-site.xml file, add the code in the red area, and pay attention to the content marked in blue
< configuration>
(5) Modify the hdfs-site.xml file
< configuration>
(6) Modify mapred-site.xml
This is the configuration of the mapreduce task. Since hadoop2.x uses the yarn framework, to achieve distributed deployment, it must be configured as yarn under the mapreduce.framework.name property. mapred.map.tasks and mapred.reduce.tasks are the number of tasks of map and reduce respectively.
[ hadoop@linux-node1 hadoop]$ cp mapred-site.xml.template mapred-site.xml
(7) Configure node yarn-site.xml
7、 Copy hadoop to other nodes
scp -r /home/hadoop/hadoop/ 192.168.0.90:/home/hadoop/ scp -r /home/hadoop/hadoop/ 192.168.0.91:/home/hadoop/ scp -r /home/hadoop/hadoop/ 192.168.0.92:/home/hadoop/
8、 Initialize NameNode with hadoop user on linux-node1
/home/hadoop/hadoop/bin/hdfs namenode –format
9、 Start hadoop
/home/hadoop/hadoop/sbin/start-dfs.sh /home/hadoop/hadoop/sbin/stop-dfs.sh
ps aux | grep --color namenode
ps aux | grep --color datanode
10、 Start the yarn distributed computing framework
[ hadoop@linux-node1 .ssh]$ /home/hadoop/hadoop/sbin/start-yarn.sh starting yarn daemons
ps aux | grep --color resourcemanager
ps aux | grep --color nodemanager
Note: The two scripts start-dfs.sh and start-yarn.sh can be replaced by start-all.sh
/home/hadoop/hadoop/sbin/stop-all.sh /home/hadoop/hadoop/sbin/start-all.sh
11、 Start jobhistory service, view mapreduce status
[ hadoop@linux-node1 ~]$ /home/hadoop/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /home/hadoop/hadoop/logs/mapred-hadoop-historyserver-linux-node1.out
12、 View the status of HDFS distributed file system
/home/hadoop/hadoop/bin/hdfs dfsadmin –report
/home/hadoop/hadoop/bin/hdfs fsck / -files -blocks
13、 View Hadoop cluster status on web page****View HDFS status:http://192.168.0.89:50070/View Hadoop cluster status:http://192.168.0.89:8088/
Recommended Posts