Creating Tables using ParquetΒΆ

Let us create order_items table using Parquet file format. By default, the files of table using Parquet file format are compressed using Snappy algorithm.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • A table with parquet file format can be external.

  • In our case we will create managed table with file format as parquet in STORED AS clause.

  • We will explore INSERT to insert query results into this table of type parquet.

%%sql

USE itversity_retail
%%sql

SHOW tables
  • Drop order_items, if it already exists

%%sql

DROP TABLE IF EXISTS order_items
%%sql

CREATE TABLE order_items (
  order_item_id INT,
  order_item_order_id INT,
  order_item_product_id INT,
  order_item_quantity INT,
  order_item_subtotal FLOAT,
  order_item_product_price FLOAT
) STORED AS parquet
  • To get complete output run the below command using spark-sql. Here is the output look like.

image.png

%%sql

DESCRIBE FORMATTED order_items
spark.sql("DESCRIBE FORMATTED order_items").show(200, false)
val username = System.getProperty("user.name")
import sys.process._

s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/order_items" !