Date Manipulation Functions

Let us go through some of the important date manipulation functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Predefined Functions").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • Getting Current Date and Timestamp

  • Date Arithmetic such as date_add

  • Getting beginning date or time using trunc or date_trunc

  • Extracting information using date_format as well as calendar functions.

  • Dealing with unix timestamp using from_unixtime, to_unix_timestamp

Getting Current Date and Timestamp

Let us understand how to get the details about current or today’s date as well as current timestamp.

  • current_date is the function or operator which will return today’s date.

  • current_timestamp is the function or operator which will return current time up to milliseconds.

  • These are not like other functions and do not use () at the end.

  • These are not listed as part of SHOW functions and we can get help using DESCRIBE.

  • There is a format associated with date and timestamp.

    • Date - yyyy-MM-dd

    • Timestamp - yyyy-MM-dd HH:mm:ss.SSS

  • Keep in mind that a date or timestamp in Spark SQL are nothing but special strings containing values using above specified formats. We can apply all string manipulation functions on date or timestamp.

%%sql

SELECT current_date AS current_date
%%sql

SELECT current_date() AS current_date
%%sql

SELECT current_timestamp AS current_timestamp
spark.sql("SELECT current_timestamp AS current_timestamp").show(false)

Date Arithmetic

Let us understand how to perform arithmetic on dates or timestamps.

  • date_add can be used to add or subtract days.

  • date_sub can be used to subtract or add days.

  • datediff can be used to get difference between 2 dates

  • add_months can be used add months to a date

%%sql

SELECT date_add(current_date, 32) AS result
%%sql

SELECT date_add('2018-04-15', 730) AS result
%%sql

SELECT date_add('2018-04-15', -730) AS result
%%sql

SELECT date_sub(current_date, 30) AS result
%%sql

SELECT datediff('2019-03-30', '2017-12-31') AS result
%%sql

SELECT datediff('2017-12-31', '2019-03-30') AS result
%%sql

SELECT add_months(current_date, 3) AS result
%%sql

SELECT add_months('2019-01-31', 1) AS result
%%sql

SELECT add_months('2019-05-31', 1) AS result
%%sql

SELECT add_months(current_timestamp, 3) AS result
%%sql

SELECT date_add(current_timestamp, -730) AS result

Beginning Date or Time - trunc and date_trunc

Let us understand how to use trunc and date_trunc on dates or timestamps and get beginning date of the period.

  • We can use MM to get beginning date of the month.

  • YY can be used to get begining date of the year.

  • We can apply trunc either on date or timestamp, however we cannot apply it other than month or year (such an hour or day).

%%sql

DESCRIBE FUNCTION trunc
spark.sql("DESCRIBE FUNCTION trunc").show(false)
%%sql

SELECT trunc(current_date, 'MM') AS beginning_date_month
%%sql

SELECT trunc('2019-01-23', 'MM') AS beginning_date_month
%%sql

SELECT trunc(current_date, 'YY') AS beginning_date_year 
  • This will not work

%%sql

SELECT trunc(current_timestamp, 'HH') AS doesnt_work
  • While trunc can be used to get beginning time of a given month or year, we can get the beginning time up to Second using date_trunc.

spark.sql("DESCRIBE FUNCTION date_trunc").show(false)
%%sql

SELECT date_trunc('HOUR', current_timestamp) AS hour_beginning

Extracting information using date_format

Let us understand how to use date_format to extract information from date or timestamp.

Here is how we can get date related information such as year, month, day etc from date or timestamp.

spark.sql("DESCRIBE FUNCTION date_format").show(false)
%%sql

SELECT current_timestamp AS current_timestamp
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'yyyy') AS year
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'yy') AS year
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'MM') AS month
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'dd') AS day_of_month
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'DD') AS day_of_year
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'MMM') AS month_name
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'MMMM') AS month_name
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'EE') AS dayname
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'EEEE') AS dayname
  • Here is how we can get time related information such as hour, minute, seconds, milliseconds etc from timestamp.

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'HH') AS hour24
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'hh') AS hour12
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'mm') AS minutes
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'ss') AS seconds
%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'SS') AS millis
  • Here is how we can get the information from date or timestamp in the format we require.

%%sql

SELECT date_format(current_timestamp, 'yyyyMM') AS current_month
%%sql

SELECT date_format(current_timestamp, 'yyyyMMdd') AS current_date
%%sql

SELECT date_format(current_timestamp, 'yyyy/MM/dd') AS current_date

Extracting information - Calendar functions

We can get year, month, day etc from date or timestamp using functions. There are functions such as day, dayofmonth, month, weekofyear, year etc available for us.

spark.sql("DESCRIBE FUNCTION day").show(false)
spark.sql("DESCRIBE FUNCTION dayofmonth").show(false)
spark.sql("DESCRIBE FUNCTION month").show(false)
spark.sql("DESCRIBE FUNCTION weekofyear").show(false)
spark.sql("DESCRIBE FUNCTION year").show(false)
  • Let us see the usage of the functions such as day, dayofmonth, month, weekofyear, year etc.

%%sql

SELECT year(current_date) AS year
%%sql

SELECT month(current_date) AS month
%%sql

SELECT weekofyear(current_date) AS weekofyear
%%sql

SELECT day(current_date) AS day
%%sql

SELECT dayofmonth(current_date) AS dayofmonth

Dealing with Unix Timestamp

Let us go through the functions that can be used to deal with Unix Timestamp.

  • from_unixtime can be used to convert Unix epoch to regular timestamp.

  • unix_timestamp or to_unix_timestamp can be used to convert timestamp to Unix epoch.

  • We can get Unix epoch or Unix timestamp by running date '+%s' in Unix/Linux terminal

  • We can DESCRIBE on the above functions to get details about them.

Let us sww how we can use functions such as from_unixtime, unix_timestamp or to_unix_timestamp to convert between timestamp and Unix timestamp or epoch.

  • We can unix epoch in Unix/Linux terminal using date '+%s'

%%sql

SELECT from_unixtime(1556662731) AS timestamp
%%sql

SELECT to_unix_timestamp('2019-04-30 18:18:51') AS unixtime
%%sql

SELECT from_unixtime(1556662731, 'yyyyMM') AS month
%%sql

SELECT from_unixtime(1556662731, 'yyyy-MM-dd') AS date
%%sql

SELECT from_unixtime(1556662731, 'yyyy-MM-dd HH:mm') AS timestamp
%%sql

SELECT from_unixtime(1556662731, 'yyyy-MM-dd hh:mm') AS timestamp
%%sql

SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd') AS date
%%sql

SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd HH:mm:ss') AS timestamp