Query Example - Word CountΒΆ
Let us see how we can perform word count using Spark SQL. Using word count as an example we will understand how we can come up with the solution using pre-defined functions available.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Predefined Functions").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Create table by name lines.
Insert data into the table.
Split lines into array of words.
Explode array of words from each line into individual records.
Use group by and get the count. We cannot use
GROUP BY
directly on exploded records and hence we need to use nested sub query.
%%sql
DROP DATABASE IF EXISTS itversity_demo CASCADE
%%sql
CREATE DATABASE IF NOT EXISTS itversity_demo
%%sql
USE itversity_demo
%%sql
CREATE TABLE lines (s STRING)
%%sql
INSERT INTO lines VALUES
('Hello World'),
('How are you'),
('Let us perform the word count'),
('The definition of word count is'),
('to get the count of each word from this data')
%%sql
SELECT * FROM lines
%%sql
SELECT split(s, ' ') AS word_array FROM lines
spark.sql("SHOW functions").show(300, false)
%%sql
SELECT explode(split(s, ' ')) AS words FROM lines
%%sql
SELECT count(1) FROM (SELECT explode(split(s, ' ')) AS words FROM lines)
%%sql
SELECT explode(split(s, ' ')) AS words, count(1) FROM lines
GROUP BY explode(split(s, ' '))
%%sql
SELECT word, count(1) FROM (
SELECT explode(split(s, ' ')) AS word FROM lines
) q
GROUP BY word
%%sql
SELECT count(1) FROM
(
SELECT word, count(1) FROM (
SELECT explode(split(s, ' ')) AS word FROM lines
) q
GROUP BY word
)