Date Manipulation Functions¶
Let us go through some of the important date manipulation functions.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Predefined Functions").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Getting Current Date and Timestamp
Date Arithmetic such as
date_add
Getting beginning date or time using
trunc
ordate_trunc
Extracting information using
date_format
as well as calendar functions.Dealing with unix timestamp using
from_unixtime
,to_unix_timestamp
Getting Current Date and Timestamp¶
Let us understand how to get the details about current or today’s date as well as current timestamp.
current_date
is the function or operator which will return today’s date.current_timestamp
is the function or operator which will return current time up to milliseconds.These are not like other functions and do not use () at the end.
These are not listed as part of
SHOW functions
and we can get help usingDESCRIBE
.There is a format associated with date and timestamp.
Date -
yyyy-MM-dd
Timestamp -
yyyy-MM-dd HH:mm:ss.SSS
Keep in mind that a date or timestamp in Spark SQL are nothing but special strings containing values using above specified formats. We can apply all string manipulation functions on date or timestamp.
%%sql
SELECT current_date AS current_date
%%sql
SELECT current_date() AS current_date
%%sql
SELECT current_timestamp AS current_timestamp
spark.sql("SELECT current_timestamp AS current_timestamp").show(false)
Date Arithmetic¶
Let us understand how to perform arithmetic on dates or timestamps.
date_add
can be used to add or subtract days.date_sub
can be used to subtract or add days.datediff
can be used to get difference between 2 datesadd_months
can be used add months to a date
%%sql
SELECT date_add(current_date, 32) AS result
%%sql
SELECT date_add('2018-04-15', 730) AS result
%%sql
SELECT date_add('2018-04-15', -730) AS result
%%sql
SELECT date_sub(current_date, 30) AS result
%%sql
SELECT datediff('2019-03-30', '2017-12-31') AS result
%%sql
SELECT datediff('2017-12-31', '2019-03-30') AS result
%%sql
SELECT add_months(current_date, 3) AS result
%%sql
SELECT add_months('2019-01-31', 1) AS result
%%sql
SELECT add_months('2019-05-31', 1) AS result
%%sql
SELECT add_months(current_timestamp, 3) AS result
%%sql
SELECT date_add(current_timestamp, -730) AS result
Beginning Date or Time - trunc and date_trunc¶
Let us understand how to use trunc
and date_trunc
on dates or timestamps and get beginning date of the period.
We can use MM to get beginning date of the month.
YY can be used to get begining date of the year.
We can apply trunc either on date or timestamp, however we cannot apply it other than month or year (such an hour or day).
%%sql
DESCRIBE FUNCTION trunc
spark.sql("DESCRIBE FUNCTION trunc").show(false)
%%sql
SELECT trunc(current_date, 'MM') AS beginning_date_month
%%sql
SELECT trunc('2019-01-23', 'MM') AS beginning_date_month
%%sql
SELECT trunc(current_date, 'YY') AS beginning_date_year
This will not work
%%sql
SELECT trunc(current_timestamp, 'HH') AS doesnt_work
While
trunc
can be used to get beginning time of a given month or year, we can get the beginning time up to Second usingdate_trunc
.
spark.sql("DESCRIBE FUNCTION date_trunc").show(false)
%%sql
SELECT date_trunc('HOUR', current_timestamp) AS hour_beginning
Extracting information using date_format¶
Let us understand how to use date_format
to extract information from date or timestamp.
Here is how we can get date related information such as year, month, day etc from date or timestamp.
spark.sql("DESCRIBE FUNCTION date_format").show(false)
%%sql
SELECT current_timestamp AS current_timestamp
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'yyyy') AS year
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'yy') AS year
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'MM') AS month
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'dd') AS day_of_month
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'DD') AS day_of_year
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'MMM') AS month_name
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'MMMM') AS month_name
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'EE') AS dayname
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'EEEE') AS dayname
Here is how we can get time related information such as hour, minute, seconds, milliseconds etc from timestamp.
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'HH') AS hour24
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'hh') AS hour12
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'mm') AS minutes
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'ss') AS seconds
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'SS') AS millis
Here is how we can get the information from date or timestamp in the format we require.
%%sql
SELECT date_format(current_timestamp, 'yyyyMM') AS current_month
%%sql
SELECT date_format(current_timestamp, 'yyyyMMdd') AS current_date
%%sql
SELECT date_format(current_timestamp, 'yyyy/MM/dd') AS current_date
Extracting information - Calendar functions¶
We can get year, month, day etc from date or timestamp using functions. There are functions such as day
, dayofmonth
, month
, weekofyear
, year
etc available for us.
spark.sql("DESCRIBE FUNCTION day").show(false)
spark.sql("DESCRIBE FUNCTION dayofmonth").show(false)
spark.sql("DESCRIBE FUNCTION month").show(false)
spark.sql("DESCRIBE FUNCTION weekofyear").show(false)
spark.sql("DESCRIBE FUNCTION year").show(false)
Let us see the usage of the functions such as day, dayofmonth, month, weekofyear, year etc.
%%sql
SELECT year(current_date) AS year
%%sql
SELECT month(current_date) AS month
%%sql
SELECT weekofyear(current_date) AS weekofyear
%%sql
SELECT day(current_date) AS day
%%sql
SELECT dayofmonth(current_date) AS dayofmonth
Dealing with Unix Timestamp¶
Let us go through the functions that can be used to deal with Unix Timestamp.
from_unixtime
can be used to convert Unix epoch to regular timestamp.unix_timestamp
orto_unix_timestamp
can be used to convert timestamp to Unix epoch.We can get Unix epoch or Unix timestamp by running
date '+%s'
in Unix/Linux terminalWe can DESCRIBE on the above functions to get details about them.
Let us sww how we can use functions such as from_unixtime
, unix_timestamp
or to_unix_timestamp
to convert between timestamp and Unix timestamp or epoch.
We can unix epoch in Unix/Linux terminal using
date '+%s'
%%sql
SELECT from_unixtime(1556662731) AS timestamp
%%sql
SELECT to_unix_timestamp('2019-04-30 18:18:51') AS unixtime
%%sql
SELECT from_unixtime(1556662731, 'yyyyMM') AS month
%%sql
SELECT from_unixtime(1556662731, 'yyyy-MM-dd') AS date
%%sql
SELECT from_unixtime(1556662731, 'yyyy-MM-dd HH:mm') AS timestamp
%%sql
SELECT from_unixtime(1556662731, 'yyyy-MM-dd hh:mm') AS timestamp
%%sql
SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd') AS date
%%sql
SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd HH:mm:ss') AS timestamp