数据关联¶
Note
在上一节,我们学习了 Spark SQL 支持的诸多算子,其中数据关联(join)是数据分析中很常见、很重要的操作。 数据关联可分为内关联、外关联、左关联和右关联。
准备数据¶
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("df operators")
.config('spark.executor.memory', '3g')
.config('spark.driver.memory', '6g')
.config("spark.executor.cores", 2)
.config('spark.cores.max', 4)
.getOrCreate())
# 创建员工信息表
seq = [(1, "Mike", 28, "Male"),
(2, "Lily", 30, "Female"),
(3, "Raymond", 26, "Male"),
(5, "Dave", 36, "Male")]
employeesDF = spark.createDataFrame(seq, ["id", "name", "age", "gender"])
employeesDF.show()
+---+-------+---+------+
| id| name|age|gender|
+---+-------+---+------+
| 1| Mike| 28| Male|
| 2| Lily| 30|Female|
| 3|Raymond| 26| Male|
| 5| Dave| 36| Male|
+---+-------+---+------+
# 创建员工薪水表
seq2 = [(1, 26000), (2, 30000), (4, 25000), (3, 20000)]
salariesDF = spark.createDataFrame(seq2, ["id", "salary"])
salariesDF.show()
+---+------+
| id|salary|
+---+------+
| 1| 26000|
| 2| 30000|
| 4| 25000|
| 3| 20000|
+---+------+
内关联¶
仅保留左右表中满足条件的数据记录。
# 左表中 name=Dave 的记录被丢弃了
# 右表中 id=4 的记录被丢弃了
employeesDF.join(salariesDF, "id", "inner").show()
+---+-------+---+------+------+
| id| name|age|gender|salary|
+---+-------+---+------+------+
| 1| Mike| 28| Male| 26000|
| 2| Lily| 30|Female| 30000|
| 3|Raymond| 26| Male| 20000|
+---+-------+---+------+------+
左关联¶
以左表为准,保留所有左表记录,尽量去匹配右表
# name=Dave 的记录被保留,salary=null
employeesDF.join(salariesDF, "id", "left").show()
+---+-------+---+------+------+
| id| name|age|gender|salary|
+---+-------+---+------+------+
| 1| Mike| 28| Male| 26000|
| 2| Lily| 30|Female| 30000|
| 3|Raymond| 26| Male| 20000|
| 5| Dave| 36| Male| null|
+---+-------+---+------+------+
右关联¶
与左关联相反,以右表为准
# id=5 的记录被保留,name=age=gender=null
employeesDF.join(salariesDF, "id", "right").show()
+---+-------+----+------+------+
| id| name| age|gender|salary|
+---+-------+----+------+------+
| 1| Mike| 28| Male| 26000|
| 2| Lily| 30|Female| 30000|
| 3|Raymond| 26| Male| 20000|
| 4| null|null| null| 25000|
+---+-------+----+------+------+
外关联¶
也被称为全关联(full),顾名思义,我全都要。
# name=Dave 和 id=5 的记录都被保留
employeesDF.join(salariesDF, "id", "outer").show()
+---+-------+----+------+------+
| id| name| age|gender|salary|
+---+-------+----+------+------+
| 1| Mike| 28| Male| 26000|
| 2| Lily| 30|Female| 30000|
| 3|Raymond| 26| Male| 20000|
| 4| null|null| null| 25000|
| 5| Dave| 36| Male| null|
+---+-------+----+------+------+