Joining Tables - Outer¶
Let us understand how to perform outer joins using Spark SQL. There are 3 different types of outer joins.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Basic Transformations").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
LEFT OUTER JOIN (default) - Get all the records from both the datasets which satisfies JOIN condition along with those records which are in the left side table but not in the right side table.
RIGHT OUTER JOIN - Get all the records from both the datasets which satisfies JOIN condition along with those records which are in the right side table but not in the left side table.
FULL OUTER JOIN - left union right
When we perform the outer join (lets say left outer join), we will see this.
Get all the values from both the tables when join condition satisfies.
If there are rows on left side tables for which there are no corresponding values in right side table, all the projected column values for right side table will be null.
Here are some of the examples for outer join.
Get all the orders where there are no corresponding order items.
Get all the order items where there are no corresponding orders.
%%sql
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
LIMIT 10
%%sql
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
%%sql
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
LIMIT 10
%%sql
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
%%sql
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
AND o.order_status IN ('COMPLETE', 'CLOSED')
%%sql
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
LIMIT 10
%%sql
SELECT count(1)
FROM orders o RIGHT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
%%sql
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_id IS NULL
LIMIT 10
Using Spark SQL with Python or Scala
spark.sql("""
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
""").show()
spark.sql("""
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
""").show()
spark.sql("""
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
""").show()
spark.sql("""
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
spark.sql("""
SELECT count(1)
FROM orders o LEFT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE oi.order_item_order_id IS NULL
AND o.order_status IN ('COMPLETE', 'CLOSED')
""").show()
spark.sql("""
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
""").show()
spark.sql("""
SELECT count(1)
FROM orders o RIGHT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
""").show()
spark.sql("""
SELECT o.order_id,
o.order_date,
o.order_status,
oi.order_item_order_id,
oi.order_item_subtotal
FROM orders o RIGHT OUTER JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_id IS NULL
""").show()