本次设计共四节点:1个master+3个slave。
下面的 JDK 和 Hadoop 安装配置操作都是使用普通用户 hadoop 来执行,并非 root。

集群环境准备

每台机器配置hostname、hosts

hostnamectl set-hostname hadoop2

echo "# Hadoop
192.100.3.254    hadoop0
192.100.3.253  hadoop1
192.100.3.252  hadoop2
192.100.3.251  hadoop3" >> /etc/hosts;

新建hadoop用户

useradd -m -s /bin/bash hadoop

以下均以Hadoop执行

需将/opt/usr/local/java/等目录所有者改为hadoop
su - hadoop
#master
ssh-keygen -t rsa -C “hadoop0” -P “”
#slave1
ssh-keygen -t rsa -C “hadoop1” -P “”
#slave2
ssh-keygen -t rsa -C “hadoop2” -P “”
#slave3
ssh-keygen -t rsa -C “hadoop3” -P “”

各节点配置免密登录

ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop0
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop2
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop3

JDK 安装与配置

JDK手动安装到/usr/lib/java

#切换到hadoop用户
sudo - hadoop
ln -sf /usr/lib/java/jdk1.8.0_331/ /usr/lib/java/jdk

JDK环境变量的配置

vi /etc/profile.d/java.sh

#JDK environment
export JAVA_HOME=/usr/lib/java/jdk
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATh=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

使profile生效,验证java版本

source /etc/profile
java -version

Hadoop 部署与配置

hadoop安装包到/opt;修改所有者;配置软链接;

chown -R hadoop:hadoop /opt/hadoop-3.2.3
ln -sf /opt/hadoop-3.2.3 /opt/hadoop

配置日志路径

/opt/hadoop-3.2.3/etc/hadoop/log4j.properties

mkdir /opt/hadoop/logs
mkdir -p /opt/hadoop/hdfs/name
mkdir -p /opt/hadoop/hdfs/data
nano /etc/profile.d/hadoop.sh

配置Hadoop环境变量

# Hadoop environment
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

source /etc/profile

Hadoop文件配置

配置文件都在/opt/hadoop/etc/hadoop/文件夹下
hadoop-env.shcore-site.xmlhdfs-site.xml(其余workersmapred-site.xmlyarn-site.xml等内容略)

jdk环境变量 (要远程调用 ${java_home}

export JAVA_HOME=/usr/local/java/jdk


core-site.xml

<configuration>

  <!-- hdfs的位置 -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop0:9000</value>
  </property>

  <!-- hadoop运行的缓冲文件位置 -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop/tmp</value>
  </property>

</configuration>



hdfs-site.xml

<configuration>
  <!-- hdfs 数据副本数量 -->
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <!-- hdfs namenode上存储hdfs名字空间元数据 -->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/opt/hadoop/hdfs/name</value>
  </property>
  <!-- hdfs datanode上数据块的物理存储位置 -->
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/opt/hadoop/hdfs/data</value>
  </property>
<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>
</configuration>

Hadoop 验证

首先格式化

只能格式化一次,如出现问题需重新格式化参考下方操作预处理(数据会全部丢失!!!)

hdfs namenode -format
启动提示“Name or service not knownstname”
  1. 删除workers后重建文件
  2. 需查看core-site.xmlhdfs-site.xml路径,删除hdfs对应目录后重建目录,
  3. 重新格式化hdfs

启动、关闭服务测试

mapred —daemon start historyserver
mapred —daemon stop historyserver
start-yarn.sh && start-dfs.sh
stop-dfs.sh && stop-yarn.sh

Spark on yarn部署、配置

vi /opt/spark/conf/spark-env.sh

export JAVA_HOME=/usr/local/java/jdk
# Hadoop 的配置文件目录
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# YARN 的配置文件目录
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# SPARK 的目录
export SPARK_HOME=/opt/spark
# SPARK 执行文件目录
export PATH=${SPARK_HOME}/bin:$PATH

export SPARK_MASTER_HOST=hadoop0

复制到其他node

scp -r /opt/spark hadoop1:/opt/
scp -r /opt/spark hadoop2:/opt/
scp -r /opt/spark hadoop3:/opt/
  • 配置Spark环境变量
#Spark environment
export SPARK_HOME=/opt/spark
export PATH=${SPARK_HOME}/bin:$PATH
  • 配置好worker,启动spark节点
[email protected]:/opt/spark/sbin$ ./start-all.sh
hadoop0: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop0.out
hadoop2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop2.out
hadoop1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop1.out
hadoop3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-hadoop3.out

Hive配置

  • 解决Hive与Hadoop之间guava版本的差异,copy hadoop中的guava到hive
cp hadoop/share/hadoop/common/lib/guava-*.jar hive/lib/
  • 驱动包放到hive/lib下,mssql-jdbc-7.4.1.jre8.jar
  • 配置hive-site.xmlhive-env.sh
  • 初始化元数据
schematool -initSchema -dbType mssql --verbose
Failed to initialize pool: 驱动程序无法通过使用安全套接字层(SSL)加密与 SQL Server 建立安全连接。unable to find valid certification path to requested target

修改jre\lib\security\java.security,删除jdk.tls.disabledAlgorithms中的3DES_EDE_CBC


安全配置相关

Hadoop3.0开始默认端口的更改

Namenode 端口:

https服务的端口50470 → 9871
NameNode web管理端口50070 → 9870配置文件hdfs-site.xml
namenode RPC交互端口,用于获取文件系统metadata信息。8020 → 9820配置文件core-site.xml

Secondary NN 端口:

暂未了解到50091 → 9869
secondary NameNode web管理端口50090 → 9868

Datanode 端口:

datanode的IPC服务器地址和端口50020 → 9867配置文件hdfs-site.xml
datanode控制端口,用于数据传输50010 → 9866配置文件hdfs-site.xml
https服务的端口50475 → 9865
datanode的HTTP服务器和端口50075 → 9864配置文件hdfs-site.xml

其他需要放行的端口

job tracker交互端口8021
Hive服务端口 10000:10002
spark web UI端口4040:spark-defaults.conf中spark.ui.port
spark master注册7077
spark masterUI8080; spark WorkerUI8081