Hadoop+Spark集群部署
本次设计共四节点:1个master+3个slave。
下面的 JDK 和 Hadoop 安装配置操作都是使用普通用户 hadoop 来执行,并非 root。
集群环境准备
每台机器配置hostname、hosts
hostnamectl set-hostname hadoop2
echo "# Hadoop
192.100.3.254 hadoop0
192.100.3.253 hadoop1
192.100.3.252 hadoop2
192.100.3.251 hadoop3" >> /etc/hosts;
新建hadoop用户
useradd -m -s /bin/bash hadoop
以下均以Hadoop执行
需将/opt
、/usr/local/java/
等目录所有者改为hadoop
su - hadoop
#master
ssh-keygen -t rsa -C “hadoop0” -P “”
#slave1
ssh-keygen -t rsa -C “hadoop1” -P “”
#slave2
ssh-keygen -t rsa -C “hadoop2” -P “”
#slave3
ssh-keygen -t rsa -C “hadoop3” -P “”
各节点配置免密登录
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop0
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop3
JDK 安装与配置
JDK手动安装到/usr/lib/java
#切换到hadoop用户
sudo - hadoop
ln -sf /usr/lib/java/jdk1.8.0_331/ /usr/lib/java/jdk
JDK环境变量的配置
vi /etc/profile.d/java.sh
#JDK environment
export JAVA_HOME=/usr/lib/java/jdk
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATh=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
使profile生效,验证java版本
source /etc/profile
java -version
Hadoop 部署与配置
hadoop安装包到/opt
;修改所有者;配置软链接;
chown -R hadoop:hadoop /opt/hadoop-3.2.3
ln -sf /opt/hadoop-3.2.3 /opt/hadoop
配置日志路径
/opt/hadoop-3.2.3/etc/hadoop/log4j.properties
mkdir /opt/hadoop/logs
mkdir -p /opt/hadoop/hdfs/name
mkdir -p /opt/hadoop/hdfs/data
nano /etc/profile.d/hadoop.sh
配置Hadoop环境变量
# Hadoop environment
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
source /etc/profile
Hadoop文件配置
配置文件都在/opt/hadoop/etc/hadoop/文件夹下hadoop-env.sh
、core-site.xml
、hdfs-site.xml
(其余workers
、mapred-site.xml
、yarn-site.xml
等内容略)
jdk环境变量 (要远程调用 ${java_home}
)
export JAVA_HOME=/usr/local/java/jdk
core-site.xml
<configuration>
<!-- hdfs的位置 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop0:9000</value>
</property>
<!-- hadoop运行的缓冲文件位置 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<!-- hdfs 数据副本数量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- hdfs namenode上存储hdfs名字空间元数据 -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/hdfs/name</value>
</property>
<!-- hdfs datanode上数据块的物理存储位置 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
Hadoop 验证
首先格式化
只能格式化一次,如出现问题需重新格式化参考下方操作预处理(数据会全部丢失!!!)
hdfs namenode -format
启动提示“Name or service not knownstname”
- 删除
workers
后重建文件 - 需查看
core-site.xml
、hdfs-site.xml
路径,删除hdfs对应目录后重建目录, - 重新格式化hdfs
启动、关闭服务测试
mapred —daemon start historyserver
mapred —daemon stop historyserver
start-yarn.sh && start-dfs.sh
stop-dfs.sh && stop-yarn.sh
Spark on yarn部署、配置
vi /opt/spark/conf/spark-env.sh
export JAVA_HOME=/usr/local/java/jdk
# Hadoop 的配置文件目录
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# YARN 的配置文件目录
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# SPARK 的目录
export SPARK_HOME=/opt/spark
# SPARK 执行文件目录
export PATH=${SPARK_HOME}/bin:$PATH
export SPARK_MASTER_HOST=hadoop0
复制到其他node
scp -r /opt/spark hadoop1:/opt/
scp -r /opt/spark hadoop2:/opt/
scp -r /opt/spark hadoop3:/opt/
- 配置Spark环境变量
#Spark environment
export SPARK_HOME=/opt/spark
export PATH=${SPARK_HOME}/bin:$PATH
- 配置好
worker
,启动spark节点
[email protected]:/opt/spark/sbin$ ./start-all.sh
hadoop0: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop0.out
hadoop2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop2.out
hadoop1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop1.out
hadoop3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop3.out
Hive配置
- 解决Hive与Hadoop之间guava版本的差异,copy hadoop中的guava到hive
cp hadoop/share/hadoop/common/lib/guava-*.jar hive/lib/
- 驱动包放到hive/lib下,
mssql-jdbc-7.4.1.jre8.jar
- 配置
hive-site.xml
、hive-env.sh
- 初始化元数据
schematool -initSchema -dbType mssql --verbose
Failed to initialize pool: 驱动程序无法通过使用安全套接字层(SSL)加密与 SQL Server 建立安全连接。unable to find valid certification path to requested target
修改jre\lib\security\java.security,删除jdk.tls.disabledAlgorithms中的3DES_EDE_CBC
安全配置相关
Hadoop3.0开始默认端口的更改
Namenode 端口:
https服务的端口50470 → 9871
NameNode web管理端口50070 → 9870
配置文件hdfs-site.xml
namenode RPC交互端口,用于获取文件系统metadata信息。8020 → 9820
配置文件core-site.xml
Secondary NN 端口:
暂未了解到50091 → 9869
secondary NameNode web管理端口50090 → 9868
Datanode 端口:
datanode的IPC服务器地址和端口50020 → 9867
配置文件hdfs-site.xml
datanode控制端口,用于数据传输50010 → 9866
配置文件hdfs-site.xml
https服务的端口50475 → 9865
datanode的HTTP服务器和端口50075 → 9864
配置文件hdfs-site.xml
其他需要放行的端口
job tracker交互端口8021
Hive服务端口 10000:10002
spark web UI端口4040
:spark-defaults.conf中spark.ui.port
spark master注册7077
spark masterUI8080
; spark WorkerUI8081
用手機掃描下方二維碼可在手機上流覽和分享