Filtering Data¶
Let us understand how we can filter the data in Spark SQL.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Basic Transformations").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
We use
WHERE
clause to filter the data.All comparison operators such as
=
,!=
,>
,<
, etc can be used to compare a column or expression or literal with another column or expression or literal.We can use operators such as LIKE with % and regexp_like for pattern matching.
Boolan OR and AND can be performed when we want to apply multiple conditions.
Get all orders with order_status equals to COMPLETE or CLOSED. We can also use IN operator.
Get all orders from month 2014 January with order_status equals to COMPLETE or CLOSED
We need to use
IS NULL
andIS NOT NULL
to compare against null values.
%%sql
USE itversity_retail
%%sql
SHOW tables
%%sql
SELECT * FROM orders WHERE order_status = 'COMPLETE' LIMIT 10
%%sql
SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'
%%sql
SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED') LIMIT 10
%%sql
SELECT * FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED' LIMIT 10
%%sql
SELECT count(1) FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')
%%sql
SELECT count(1) FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
%%sql
SELECT * FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND order_date LIKE '2014-01%'
LIMIT 10
%%sql
SELECT count(1) FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND order_date LIKE '2014-01%'
%%sql
SELECT * FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND date_format(order_date, 'yyyy-MM') = '2014-01'
LIMIT 10
%%sql
SELECT count(1) FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND date_format(order_date, 'yyyy-MM') = '2014-01'
Using Spark SQL with Python or Scala
spark.sql("USE itversity_retail")
spark.sql("SHOW tables").show()
spark.sql("SELECT * FROM orders WHERE order_status = 'COMPLETE'").show()
spark.sql("SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'").show()
spark.sql("SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')").show()
spark.sql("""
SELECT * FROM orders
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()
spark.sql("""
SELECT count(1) FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
""").show()
spark.sql("""
SELECT count(1) FROM orders
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()
spark.sql("""
SELECT * FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND order_date LIKE '2014-01%'
""").show()
spark.sql("""
SELECT count(1) FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND order_date LIKE '2014-01%'
""").show()
spark.sql("""
SELECT * FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()
spark.sql("""
SELECT count(1) FROM orders
WHERE order_status IN ('COMPLETE', 'CLOSED')
AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()
Let us prepare the table to demonstrate how to deal with null values while filtering the data.
%%sql
DROP DATABASE IF EXISTS itversity_sms CASCADE
%%sql
CREATE DATABASE IF NOT EXISTS itversity_sms
%%sql
DROP TABLE IF EXISTS students
%%sql
CREATE TABLE students (
student_id INT,
student_first_name STRING,
student_last_name STRING,
student_phone_number STRING,
student_address STRING
) STORED AS avro
%%sql
INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)
%%sql
INSERT INTO students VALUES (2, 'Donald', 'Duck', '1234567890', NULL)
%%sql
INSERT INTO students VALUES
(3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
(4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')
%%sql
SELECT * FROM students
Using Spark SQL with Python or Scala
spark.sql("DROP DATABASE IF EXISTS itversity_sms CASCADE")
spark.sql("CREATE DATABASE IF NOT EXISTS itversity_sms")
spark.sql("DROP TABLE IF EXISTS students")
spark.sql("""
CREATE TABLE students (
student_id INT,
student_first_name STRING,
student_last_name STRING,
student_phone_number STRING,
student_address STRING
) STORED AS avro
""")
spark.sql("""
INSERT INTO students
VALUES (1, 'Scott', 'Tiger', NULL, NULL)
""")
spark.sql("""
INSERT INTO students
VALUES (2, 'Donald', 'Duck', '1234567890', NULL)
""")
spark.sql("""
INSERT INTO students VALUES
(3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
(4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')
""")
spark.sql("SELECT * FROM students").show()
Comparison against null can be done with
IS NULL
andIS NOT NULL
. Below query will not work even though we have one record with phone_numbers as null.
spark.sql("""
SELECT * FROM students
WHERE student_phone_number = NULL
""").show()
spark.sql("""
SELECT * FROM students
WHERE student_phone_number != NULL
""").show()
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NULL
""").show()
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NOT NULL
""").show()