Overview of Data Types

Let us get an overview of Data Types.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • Syntactically Hive and Spark SQL are almost same.

  • Go to this hive page and review supported data types.

  • Spark Metastore supports all standard data types.

    • Numeric - INT, BIGINT, FLOAT etc

    • Alpha Numeric or String - CHAR, VARCHAR, STRING

    • Date and Timestamp - DATE, TIMESTAMP

    • Special Data Types - ARRAY, STRUCT etc

    • Boolean - BOOLEAN

  • If the file format is text file with special types, then we need to consider other clauses under DELIMITED ROW FORMAT (if we don’t want to use default delimiters).

%%sql

DROP DATABASE IF EXISTS itversity_sms CASCADE
%%sql

CREATE DATABASE IF NOT EXISTS itversity_sms
%%sql

USE itversity_sms
%%sql

CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_numbers ARRAY<STRING>,
    student_address STRUCT<street:STRING, city:STRING, state:STRING, zip:STRING>
) STORED AS TEXTFILE
ROW FORMAT
    DELIMITED FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY ','
%%sql

DESCRIBE students
%%sql

INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)
%%sql

SELECT * FROM students
%%sql

INSERT INTO students VALUES (2, 'Donald', 'Duck', ARRAY('1234567890', '2345678901'), NULL)
%%sql

SELECT * FROM students
%%sql

INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', ARRAY('1234567890', '2345678901'), STRUCT('A Street', 'One City', 'Some State', '12345')),
    (4, 'Bubble', 'Guppy', ARRAY('5678901234', '6789012345'), STRUCT('Bubbly Street', 'Guppy', 'La la land', '45678'))
%%sql

SELECT * FROM students
val username = System.getProperty("user.name")
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_sms.db/students"!
s"hdfs dfs -cat /user/${username}/warehouse/${username}_sms.db/students/*"!
  • Using Spark SQL with Python or Scala

spark.sql("DROP DATABASE IF EXISTS itversity_sms CASCADE")
spark.sql("CREATE DATABASE IF NOT EXISTS itversity_sms")
spark.sql("USE itversity_sms")
spark.sql("DROP TABLE IF EXISTS students")
spark.sql("""
CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_numbers ARRAY<STRING>,
    student_address STRUCT<street:STRING, city:STRING, state:STRING, zip:STRING>
) STORED AS TEXTFILE
ROW FORMAT
    DELIMITED FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':'
""")
spark.sql("INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)")
spark.sql("INSERT INTO students VALUES (2, 'Donald', 'Duck', ARRAY('1234567890', '2345678901'), NULL)")
spark.sql("""
INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', ARRAY('1234567890', '2345678901'), STRUCT('A Street', 'One City', 'Some State', '12345')),
    (4, 'Bubble', 'Guppy', ARRAY('5678901234', '6789012345'), STRUCT('Bubbly Street', 'Guppy', 'La la land', '45678'))
""")
spark.sql("SELECT * FROM students")