String Manipulation Functions

We use string manipulation functions quite extensively. Here are some of the important functions which we typically use.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Predefined Functions").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • Case Conversion - lower, upper, initcap

  • Getting size of the column value - length

  • Extracting Data - substr and split

  • Trimming and Padding functions - trim, rtrim, ltrim, rpad and lpad

  • Reversing strings - reverse

  • Concatenating multiple strings concat and concat_ws

Case Conversion and Length

Let us understand how to perform case conversion functions of a string and also length of a string.

  • Case Conversion Functions - lower, upper, initcap

%%sql

SELECT lower('hEllo wOrlD') AS lower_result,
    upper('hEllo wOrlD') AS upper_result,
    initcap('hEllo wOrlD') AS initcap_result
  • Getting length - length

%%sql

SELECT length('hEllo wOrlD') AS result

Let us see how to use these functions on top of the table. We will use orders table which was loaded as part of last section.

  • order_status for some of the orders is in lower case and we will convert every thing to upper case.

%%sql

USE itversity_retail
%%sql 

SHOW tables
%%sql

SELECT * FROM orders LIMIT 10
%%sql

SELECT order_id, order_date, order_customer_id,
    lower(order_status) AS order_status,
    length(order_status) AS order_status_length
FROM orders LIMIT 10

Extracting Data - substr and split

Let us understand how to extract data from strings using substr/substring and split.

  • We can get syntax and symantecs of the functions using DESCRIBE FUNCTION

  • We can extract first four characters from string using substr or substring.

%%sql

DESCRIBE FUNCTION substr
%%sql

DESCRIBE FUNCTION substring
spark.sql("DESCRIBE FUNCTION substring").show(false)
%%sql

SELECT substr('2013-07-25 00:00:00.0', 1, 4) AS result
%%sql

SELECT substr('2013-07-25 00:00:00.0', 6, 2) AS result
%%sql

SELECT substr('2013-07-25 00:00:00.0', 9, 2) AS result
%%sql

SELECT substr('2013-07-25 00:00:00.0', 12) AS result
  • Let us see how we can extract date part from order_date of orders.

%%sql

SELECT * FROM orders LIMIT 10
%%sql

SELECT order_id,
  substr(order_date, 1, 10) AS order_date,
  order_customer_id,
  order_status
FROM orders

Let us understand how to extract the information from the string where there is a delimiter.

  • split converts delimited string into array.

%%sql

SELECT split('2013-07-25', '-') AS result
%%sql

SELECT split('2013-07-25', '-')[1] AS result
  • We can use explode to convert an array into records.

%%sql

SELECT explode(split('2013-07-25', '-')) AS result

Trimming and Padding Functions

Let us understand how to trim or remove leading and/or trailing spaces in a string.

  • ltrim is used to remove the spaces on the left side of the string.

  • rtrim is used to remove the spaces on the right side of the string.

  • trim is used to remove the spaces on both sides of the string.

%%sql

SELECT ltrim('     Hello World') AS result
%%sql

SELECT rtrim('     Hello World       ') AS result
%%sql

SELECT length(trim('     Hello World       ')) AS result

Let us understand how to use padding to pad characters to a string.

  • Let us assume that there are 3 fields - year, month and date which are of type integer.

  • If we have to concatenate all the 3 fields and create a date, we might have to pad month and date with 0.

  • lpad is used more often than rpad especially when we try to build the date from separate columns.

%%sql

SELECT 2013 AS year, 7 AS month, 25 AS myDate
%%sql

SELECT lpad(7, 2, 0) AS result
%%sql

SELECT lpad(10, 2, 0) AS result
%%sql

SELECT lpad(100, 2, 0) AS result

Reverse and Concatenating multiple strings

Let us understand how to reverse a string as well as concatenate multiple strings.

  • We can use reverse to reverse a string.

  • We can concatenate multiple strings using concat and concat_ws.

  • concat_ws is typically used if we want to have the same string between all the strings that are being concatenated.

%%sql

SELECT reverse('Hello World') AS result
%%sql

SELECT concat('Hello ', 'World') AS result
%%sql

SELECT concat('Order Status is ', order_status) AS result
FROM orders LIMIT 10
spark.sql("""
    SELECT concat('Order Status is ', order_status) AS result
    FROM orders_part LIMIT 10
""").show(false)
%%sql

SELECT * FROM (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q
%%sql

SELECT concat(year, '-', lpad(month, 2, 0), '-',
              lpad(myDate, 2, 0)) AS order_date
FROM
    (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q
%%sql

SELECT concat_ws('-', year, lpad(month, 2, 0),
              lpad(myDate, 2, 0)) AS order_date
FROM
    (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q