Creating Partitioned TablesΒΆ
Let us understand how to create partitioned table and get data into that table.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Earlier we have already created orders table. We will use that as reference and create partitioned table.
We can use
PARTITIONED BY
clause to define the column along with data type. In our case we will use order_month as partition column.We will not be able to directly load the data into the partitioned table using our original orders data (as data is not in sync with structure).
Here is the example of creating partitioned tables in Spark Metastore.
%%sql
USE itversity_retail
%%sql
SHOW tables
Drop orders_part if it already exists
%%sql
DROP TABLE IF EXISTS orders_part
%%sql
CREATE TABLE orders_part (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) PARTITIONED BY (order_month INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
%%sql
DESCRIBE orders_part
spark.sql("DESCRIBE FORMATTED orders_part").show(200, false)
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders_part" !