Managed Tables - Exercise¶
Let us use NYSE data and see how we can create tables in Spark Metastore.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Duration: 30 Minutes
Data Location (Local): /data/nyse_all/nyse_data
Create a database with the name - YOUR_OS_USER_NAME_nyse
Table Name: nyse_eod
File Format: TEXTFILE (default)
Review the files by running Linux commands before using data sets. Data is compressed and we can load the files as is.
Copy one of the zip file to your home directory and preview the data. There should be 7 fields. You need to determine the delimiter.
Field Names: stockticker, tradedate, openprice, highprice, lowprice, closeprice, volume. For example, you need to use
BIGINT
for volume notINT
.Determine correct data types based on the values
Create Managed table with default Delimiter.
As delimiters in data and table are not same, you need to figure out how to get data into the target table.
Make sure the data is copied into the table as per the structure defined and validate.
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Validation¶
Run the following queries to ensure that you will be able to read the data.
DESCRIBE FORMATTED YOUR_OS_USER_NAME_nyse.nyse_eod;
SELECT * FROM YOUR_OS_USER_NAME_nyse.nyse_eod LIMIT 10
SELECT count(1) FROM YOUR_OS_USER_NAME_nyse.nyse_eod;
// There should not be field delimiter as the requirement is to use default delimiter
spark.sql("DESCRIBE FORMATTED itversity_nyse.nyse_eod").show(200, false)
%%sql
SELECT * FROM itversity_nyse.nyse_eod LIMIT 10
%%sql
SELECT count(1) FROM itversity_nyse.nyse_eod