Loading into Partitions¶

Let us understand how to use load command to load data into partitioned tables.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")

import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

We need to make sure that file format of the file which is being loaded into table is same as the file format used while creating the table.
We also need to make sure that delimiters are consistent between files and table for text file format.
Also data should match the criteria for the partition into which data is loaded.
Our /data/retail_db/orders have data for the whole year and hence we should not load the data directly into partition.
We need to split into files matching partition criteria and then load into the table.

To use load command to load the files into partitions we need to pre-partition the data based on partition logic.

Here is the example of using simple shell commands to partition the data. Use command prompt to run these commands

rm -rf ~/orders
mkdir -p ~/orders

grep 2013-07 /data/retail_db/orders/part-00000 > ~/orders/orders_201307
grep 2013-08 /data/retail_db/orders/part-00000 > ~/orders/orders_201308
grep 2013-09 /data/retail_db/orders/part-00000 > ~/orders/orders_201309
grep 2013-10 /data/retail_db/orders/part-00000 > ~/orders/orders_201310

Let us see how we can load data into corresponding partitions. Data has to be pre-partitioned based on the partitioned column.

%%sql

USE itversity_retail

%%sql

LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201307'
  INTO TABLE orders_part PARTITION (order_month=201307)

import sys.process._

s"hdfs dfs -ls -R /user/${username}/warehouse/${username}_retail.db/orders_part" !

%%sql

LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201308'
  INTO TABLE orders_part PARTITION (order_month=201308)

%%sql

LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201309'
  INTO TABLE orders_part PARTITION (order_month=201309)

%%sql

LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201310'
  INTO TABLE orders_part PARTITION (order_month=201310)

import sys.process._

s"hdfs dfs -ls -R /user/${username}/warehouse/${username}_retail.db/orders_part" !

import sys.process._

s"hdfs dfs -tail /user/${username}/warehouse/${username}_retail.db/orders_part/order_month=201310/orders_201310"!

%%sql

SELECT * FROM orders_part LIMIT 10

Apache Spark using SQL

Loading into Partitions¶