Loading into PartitionsΒΆ
Let us understand how to use load command to load data into partitioned tables.
Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.
val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession
val username = System.getProperty("user.name")
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
enableHiveSupport.
appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").
master("yarn").
getOrCreate
If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
We need to make sure that file format of the file which is being loaded into table is same as the file format used while creating the table.
We also need to make sure that delimiters are consistent between files and table for text file format.
Also data should match the criteria for the partition into which data is loaded.
Our
/data/retail_db/orders
have data for the whole year and hence we should not load the data directly into partition.We need to split into files matching partition criteria and then load into the table.
To use load command to load the files into partitions we need to pre-partition the data based on partition logic.
Here is the example of using simple shell commands to partition the data. Use command prompt to run these commands
rm -rf ~/orders
mkdir -p ~/orders
grep 2013-07 /data/retail_db/orders/part-00000 > ~/orders/orders_201307
grep 2013-08 /data/retail_db/orders/part-00000 > ~/orders/orders_201308
grep 2013-09 /data/retail_db/orders/part-00000 > ~/orders/orders_201309
grep 2013-10 /data/retail_db/orders/part-00000 > ~/orders/orders_201310
Let us see how we can load data into corresponding partitions. Data has to be pre-partitioned based on the partitioned column.
%%sql
USE itversity_retail
%%sql
LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201307'
INTO TABLE orders_part PARTITION (order_month=201307)
import sys.process._
s"hdfs dfs -ls -R /user/${username}/warehouse/${username}_retail.db/orders_part" !
%%sql
LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201308'
INTO TABLE orders_part PARTITION (order_month=201308)
%%sql
LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201309'
INTO TABLE orders_part PARTITION (order_month=201309)
%%sql
LOAD DATA LOCAL INPATH '/home/itversity/orders/orders_201310'
INTO TABLE orders_part PARTITION (order_month=201310)
import sys.process._
s"hdfs dfs -ls -R /user/${username}/warehouse/${username}_retail.db/orders_part" !
import sys.process._
s"hdfs dfs -tail /user/${username}/warehouse/${username}_retail.db/orders_part/order_month=201310/orders_201310"!
%%sql
SELECT * FROM orders_part LIMIT 10