String Manipulation Functions¶
We use string manipulation functions quite extensively. Here are some of the important functions which we typically use.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Predefined Functions").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Case Conversion -
lower
,upper
,initcap
Getting size of the column value -
length
Extracting Data -
substr
andsplit
Trimming and Padding functions -
trim
,rtrim
,ltrim
,rpad
andlpad
Reversing strings -
reverse
Concatenating multiple strings
concat
andconcat_ws
Case Conversion and Length¶
Let us understand how to perform case conversion functions of a string and also length of a string.
Case Conversion Functions -
lower
,upper
,initcap
%%sql
SELECT lower('hEllo wOrlD') AS lower_result,
upper('hEllo wOrlD') AS upper_result,
initcap('hEllo wOrlD') AS initcap_result
Getting length -
length
%%sql
SELECT length('hEllo wOrlD') AS result
Let us see how to use these functions on top of the table. We will use orders table which was loaded as part of last section.
order_status for some of the orders is in lower case and we will convert every thing to upper case.
%%sql
USE itversity_retail
%%sql
SHOW tables
%%sql
SELECT * FROM orders LIMIT 10
%%sql
SELECT order_id, order_date, order_customer_id,
lower(order_status) AS order_status,
length(order_status) AS order_status_length
FROM orders LIMIT 10
Extracting Data - substr and split¶
Let us understand how to extract data from strings using substr
/substring
and split
.
We can get syntax and symantecs of the functions using
DESCRIBE FUNCTION
We can extract first four characters from string using substr or substring.
%%sql
DESCRIBE FUNCTION substr
%%sql
DESCRIBE FUNCTION substring
spark.sql("DESCRIBE FUNCTION substring").show(false)
%%sql
SELECT substr('2013-07-25 00:00:00.0', 1, 4) AS result
%%sql
SELECT substr('2013-07-25 00:00:00.0', 6, 2) AS result
%%sql
SELECT substr('2013-07-25 00:00:00.0', 9, 2) AS result
%%sql
SELECT substr('2013-07-25 00:00:00.0', 12) AS result
Let us see how we can extract date part from order_date of orders.
%%sql
SELECT * FROM orders LIMIT 10
%%sql
SELECT order_id,
substr(order_date, 1, 10) AS order_date,
order_customer_id,
order_status
FROM orders
Let us understand how to extract the information from the string where there is a delimiter.
split
converts delimited string into array.
%%sql
SELECT split('2013-07-25', '-') AS result
%%sql
SELECT split('2013-07-25', '-')[1] AS result
We can use explode to convert an array into records.
%%sql
SELECT explode(split('2013-07-25', '-')) AS result
Trimming and Padding Functions¶
Let us understand how to trim or remove leading and/or trailing spaces in a string.
ltrim
is used to remove the spaces on the left side of the string.rtrim
is used to remove the spaces on the right side of the string.trim
is used to remove the spaces on both sides of the string.
%%sql
SELECT ltrim(' Hello World') AS result
%%sql
SELECT rtrim(' Hello World ') AS result
%%sql
SELECT length(trim(' Hello World ')) AS result
Let us understand how to use padding to pad characters to a string.
Let us assume that there are 3 fields - year, month and date which are of type integer.
If we have to concatenate all the 3 fields and create a date, we might have to pad month and date with 0.
lpad
is used more often thanrpad
especially when we try to build the date from separate columns.
%%sql
SELECT 2013 AS year, 7 AS month, 25 AS myDate
%%sql
SELECT lpad(7, 2, 0) AS result
%%sql
SELECT lpad(10, 2, 0) AS result
%%sql
SELECT lpad(100, 2, 0) AS result
Reverse and Concatenating multiple strings¶
Let us understand how to reverse a string as well as concatenate multiple strings.
We can use
reverse
to reverse a string.We can concatenate multiple strings using
concat
andconcat_ws
.concat_ws
is typically used if we want to have the same string between all the strings that are being concatenated.
%%sql
SELECT reverse('Hello World') AS result
%%sql
SELECT concat('Hello ', 'World') AS result
%%sql
SELECT concat('Order Status is ', order_status) AS result
FROM orders LIMIT 10
spark.sql("""
SELECT concat('Order Status is ', order_status) AS result
FROM orders_part LIMIT 10
""").show(false)
%%sql
SELECT * FROM (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q
%%sql
SELECT concat(year, '-', lpad(month, 2, 0), '-',
lpad(myDate, 2, 0)) AS order_date
FROM
(SELECT 2013 AS year, 7 AS month, 25 AS myDate) q
%%sql
SELECT concat_ws('-', year, lpad(month, 2, 0),
lpad(myDate, 2, 0)) AS order_date
FROM
(SELECT 2013 AS year, 7 AS month, 25 AS myDate) q