Filtering Data

Let us understand how we can filter the data in Spark SQL.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • We use WHERE clause to filter the data.

  • All comparison operators such as =, !=, >, <, etc can be used to compare a column or expression or literal with another column or expression or literal.

  • We can use operators such as LIKE with % and regexp_like for pattern matching.

  • Boolan OR and AND can be performed when we want to apply multiple conditions.

    • Get all orders with order_status equals to COMPLETE or CLOSED. We can also use IN operator.

    • Get all orders from month 2014 January with order_status equals to COMPLETE or CLOSED

  • We need to use IS NULL and IS NOT NULL to compare against null values.

%%sql

USE itversity_retail
%%sql

SHOW tables
%%sql

SELECT * FROM orders WHERE order_status = 'COMPLETE' LIMIT 10
%%sql

SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'
%%sql

SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED') LIMIT 10
%%sql

SELECT * FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED' LIMIT 10
%%sql

SELECT count(1) FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')
%%sql

SELECT count(1) FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
%%sql

SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
LIMIT 10
%%sql

SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
%%sql

SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
LIMIT 10
%%sql

SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
  • Using Spark SQL with Python or Scala

spark.sql("USE itversity_retail")
spark.sql("SHOW tables").show()
spark.sql("SELECT * FROM orders WHERE order_status = 'COMPLETE'").show()
spark.sql("SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'").show()
spark.sql("SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')").show()
spark.sql("""
SELECT * FROM orders 
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
""").show()
spark.sql("""
SELECT count(1) FROM orders
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()
spark.sql("""
SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
""").show()
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
""").show()
spark.sql("""
SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()
  • Let us prepare the table to demonstrate how to deal with null values while filtering the data.

%%sql

DROP DATABASE IF EXISTS itversity_sms CASCADE
%%sql

CREATE DATABASE IF NOT EXISTS itversity_sms
%%sql

DROP TABLE IF EXISTS students
%%sql

CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_number STRING,
    student_address STRING
) STORED AS avro
%%sql

INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)
%%sql

INSERT INTO students VALUES (2, 'Donald', 'Duck', '1234567890', NULL)
%%sql

INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
    (4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')
%%sql

SELECT * FROM students
  • Using Spark SQL with Python or Scala

spark.sql("DROP DATABASE IF EXISTS itversity_sms CASCADE")
spark.sql("CREATE DATABASE IF NOT EXISTS itversity_sms")
spark.sql("DROP TABLE IF EXISTS students")
spark.sql("""
CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_number STRING,
    student_address STRING
) STORED AS avro
""")
spark.sql("""
INSERT INTO students 
VALUES (1, 'Scott', 'Tiger', NULL, NULL)
""")
spark.sql("""
INSERT INTO students 
VALUES (2, 'Donald', 'Duck', '1234567890', NULL)
""")
spark.sql("""
INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
    (4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')
""")
spark.sql("SELECT * FROM students").show()
  • Comparison against null can be done with IS NULL and IS NOT NULL. Below query will not work even though we have one record with phone_numbers as null.

spark.sql("""
SELECT * FROM students 
WHERE student_phone_number = NULL
""").show()
spark.sql("""
SELECT * FROM students 
WHERE student_phone_number != NULL
""").show()
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NULL
""").show()
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NOT NULL
""").show()