Managed Tables - Exercise

Let us use NYSE data and see how we can create tables in Spark Metastore.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • Duration: 30 Minutes

  • Data Location (Local): /data/nyse_all/nyse_data

  • Create a database with the name - YOUR_OS_USER_NAME_nyse

  • Table Name: nyse_eod

  • File Format: TEXTFILE (default)

  • Review the files by running Linux commands before using data sets. Data is compressed and we can load the files as is.

  • Copy one of the zip file to your home directory and preview the data. There should be 7 fields. You need to determine the delimiter.

  • Field Names: stockticker, tradedate, openprice, highprice, lowprice, closeprice, volume. For example, you need to use BIGINT for volume not INT.

  • Determine correct data types based on the values

  • Create Managed table with default Delimiter.

As delimiters in data and table are not same, you need to figure out how to get data into the target table.

  • Make sure the data is copied into the table as per the structure defined and validate.

import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Validation

Run the following queries to ensure that you will be able to read the data.

DESCRIBE FORMATTED YOUR_OS_USER_NAME_nyse.nyse_eod;
SELECT * FROM YOUR_OS_USER_NAME_nyse.nyse_eod LIMIT 10
SELECT count(1) FROM YOUR_OS_USER_NAME_nyse.nyse_eod;
// There should not be field delimiter as the requirement is to use default delimiter
spark.sql("DESCRIBE FORMATTED itversity_nyse.nyse_eod").show(200, false)
%%sql

SELECT * FROM itversity_nyse.nyse_eod LIMIT 10
%%sql

SELECT count(1) FROM itversity_nyse.nyse_eod