Query Example - Word CountΒΆ

Let us see how we can perform word count using Spark SQL. Using word count as an example we will understand how we can come up with the solution using pre-defined functions available.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Predefined Functions").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • Create table by name lines.

  • Insert data into the table.

  • Split lines into array of words.

  • Explode array of words from each line into individual records.

  • Use group by and get the count. We cannot use GROUP BY directly on exploded records and hence we need to use nested sub query.

%%sql

DROP DATABASE IF EXISTS itversity_demo CASCADE
%%sql

CREATE DATABASE IF NOT EXISTS itversity_demo
%%sql

USE itversity_demo
%%sql

CREATE TABLE lines (s STRING)
%%sql

INSERT INTO lines VALUES
  ('Hello World'),
  ('How are you'),
  ('Let us perform the word count'),
  ('The definition of word count is'),
  ('to get the count of each word from this data')
%%sql

SELECT * FROM lines
%%sql

SELECT split(s, ' ') AS word_array FROM lines
spark.sql("SHOW functions").show(300, false)
%%sql

SELECT explode(split(s, ' ')) AS words FROM lines
%%sql

SELECT count(1) FROM (SELECT explode(split(s, ' ')) AS words FROM lines)
%%sql

SELECT explode(split(s, ' ')) AS words, count(1) FROM lines
GROUP BY explode(split(s, ' '))
%%sql

SELECT word, count(1) FROM (
  SELECT explode(split(s, ' ')) AS word FROM lines
) q
GROUP BY word
%%sql

SELECT count(1) FROM
(
    SELECT word, count(1) FROM (
        SELECT explode(split(s, ' ')) AS word FROM lines
    ) q
    GROUP BY word
)