数据关联

Note

在上一节,我们学习了 Spark SQL 支持的诸多算子,其中数据关联(join)是数据分析中很常见、很重要的操作。 数据关联可分为内关联、外关联、左关联和右关联。

准备数据

from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .appName("df operators")
         .config('spark.executor.memory', '3g')
         .config('spark.driver.memory', '6g')
         .config("spark.executor.cores", 2)
         .config('spark.cores.max', 4)
         .getOrCreate())
# 创建员工信息表
seq = [(1, "Mike", 28, "Male"), 
       (2, "Lily", 30, "Female"), 
       (3, "Raymond", 26, "Male"), 
       (5, "Dave", 36, "Male")]
employeesDF = spark.createDataFrame(seq, ["id", "name", "age", "gender"])
employeesDF.show()
+---+-------+---+------+
| id|   name|age|gender|
+---+-------+---+------+
|  1|   Mike| 28|  Male|
|  2|   Lily| 30|Female|
|  3|Raymond| 26|  Male|
|  5|   Dave| 36|  Male|
+---+-------+---+------+
# 创建员工薪水表
seq2 = [(1, 26000), (2, 30000), (4, 25000), (3, 20000)]
salariesDF = spark.createDataFrame(seq2, ["id", "salary"])
salariesDF.show()
+---+------+
| id|salary|
+---+------+
|  1| 26000|
|  2| 30000|
|  4| 25000|
|  3| 20000|
+---+------+

内关联

仅保留左右表中满足条件的数据记录。

# 左表中 name=Dave 的记录被丢弃了
# 右表中 id=4 的记录被丢弃了
employeesDF.join(salariesDF, "id", "inner").show()
+---+-------+---+------+------+
| id|   name|age|gender|salary|
+---+-------+---+------+------+
|  1|   Mike| 28|  Male| 26000|
|  2|   Lily| 30|Female| 30000|
|  3|Raymond| 26|  Male| 20000|
+---+-------+---+------+------+

左关联

以左表为准,保留所有左表记录,尽量去匹配右表

# name=Dave 的记录被保留,salary=null
employeesDF.join(salariesDF, "id", "left").show()
+---+-------+---+------+------+
| id|   name|age|gender|salary|
+---+-------+---+------+------+
|  1|   Mike| 28|  Male| 26000|
|  2|   Lily| 30|Female| 30000|
|  3|Raymond| 26|  Male| 20000|
|  5|   Dave| 36|  Male|  null|
+---+-------+---+------+------+

右关联

与左关联相反,以右表为准

# id=5 的记录被保留,name=age=gender=null
employeesDF.join(salariesDF, "id", "right").show()
+---+-------+----+------+------+
| id|   name| age|gender|salary|
+---+-------+----+------+------+
|  1|   Mike|  28|  Male| 26000|
|  2|   Lily|  30|Female| 30000|
|  3|Raymond|  26|  Male| 20000|
|  4|   null|null|  null| 25000|
+---+-------+----+------+------+

外关联

也被称为全关联(full),顾名思义,我全都要。

# name=Dave 和 id=5 的记录都被保留
employeesDF.join(salariesDF, "id", "outer").show()
+---+-------+----+------+------+
| id|   name| age|gender|salary|
+---+-------+----+------+------+
|  1|   Mike|  28|  Male| 26000|
|  2|   Lily|  30|Female| 30000|
|  3|Raymond|  26|  Male| 20000|
|  4|   null|null|  null| 25000|
|  5|   Dave|  36|  Male|  null|
+---+-------+----+------+------+