Handling NULL ValuesΒΆ

Let us understand how to handle nulls using specific functions in Spark SQL.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Predefined Functions").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • By default if we try to add or concatenate null to another column or expression or literal, it will return null.

  • If we want to replace null with some default value, we can use nvl. For not null values, nvl returns the original expression value.

    • Replace commission_pct with 0 if it is null.

    • We can also use coalesce in the place of nvl.

  • coalesce returns first not null value if we pass multiple arguments to it.

  • nvl2 can be used to perform one action when the value is not null and some other action when the value is null.

    • We want to increase commission_pct by 1 if it is not null and set commission_pct to 2 if it is null.

  • We can also use CASE WHEN ELSE END for any conditional logic.

%%sql

SELECT 1 + NULL AS result
%%sql

SELECT concat('Hello', NULL) AS result
%%sql

SELECT nvl(1, 0) nvl, coalesce(1, 0) AS coalesce
%%sql

SELECT nvl(NULL, 0) nvl , coalesce(1, 0) AS coalesce
%%sql

SELECT coalesce(NULL, NULL, 2, NULL, 3) AS result
%%sql

SELECT nvl(NULL, NULL, 2, NULL, 3) AS result
%%sql

USE itversity_retail
%%sql

DROP TABLE IF EXISTS sales
%%sql

CREATE TABLE IF NOT EXISTS sales(
    sales_person_id INT,
    sales_amount FLOAT,
    commission_pct INT
)
%%sql

INSERT INTO sales VALUES
    (1, 1000, 10),
    (2, 1500, 8),
    (3, 500, NULL),
    (4, 800, 5),
    (5, 250, NULL)
%%sql

SELECT * FROM sales
%%sql

SELECT s.*, 
    nvl(commission_pct, 0) AS commission_pct
FROM sales AS s
%%sql

SELECT s.*, 
    coalesce(commission_pct, 0) AS commission_pct
FROM sales AS s
%%sql

SELECT s.*, 
    round(sales_amount * commission_pct / 100, 2) AS incorrect_commission_amount
FROM sales AS s
%%sql

SELECT s.*, 
    round(sales_amount * nvl(commission_pct, 0) / 100, 2) AS commission_amount
FROM sales AS s
%%sql

SELECT s.*, 
    round(sales_amount * coalesce(commission_pct, 0) / 100, 2) AS commission_amount
FROM sales AS s
%%sql

SELECT s.*, 
    nvl2(commission_pct, commission_pct + 1, 2) AS commission_pct
FROM sales AS s
%%sql

SELECT s.*, 
    round(sales_amount * nvl2(commission_pct, commission_pct + 1, 2) / 100, 2) AS commission_amount
FROM sales AS s
%%sql

SELECT s.*, 
    CASE WHEN commission_pct IS NULL 
        THEN 2
        ELSE commission_pct + 1
    END AS commission_pct
FROM sales AS s
%%sql

SELECT s.*, 
    CASE WHEN commission_pct IS NOT NULL 
        THEN commission_pct + 1
        ELSE 2
    END AS commission_pct
FROM sales AS s
%%sql

SELECT s.*, 
    CASE WHEN commission_pct IS NULL 
        THEN round((sales_amount * 2 / 100), 2)
        ELSE round((sales_amount * (commission_pct + 1)/ 100), 2)
    END AS commission_amount
FROM sales AS s