Creating Partitioned TablesΒΆ

Let us understand how to create partitioned table and get data into that table.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • Earlier we have already created orders table. We will use that as reference and create partitioned table.

  • We can use PARTITIONED BY clause to define the column along with data type. In our case we will use order_month as partition column.

  • We will not be able to directly load the data into the partitioned table using our original orders data (as data is not in sync with structure).

Here is the example of creating partitioned tables in Spark Metastore.

%%sql

USE itversity_retail
%%sql

SHOW tables
  • Drop orders_part if it already exists

%%sql

DROP TABLE IF EXISTS orders_part
%%sql

CREATE TABLE orders_part (
  order_id INT,
  order_date STRING,
  order_customer_id INT,
  order_status STRING
) PARTITIONED BY (order_month INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
%%sql

DESCRIBE orders_part
spark.sql("DESCRIBE FORMATTED orders_part").show(200, false)
import sys.process._

s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders_part" !