Loading Data into Tables - HDFSΒΆ

Let us understand how we can load data from HDFS location into Spark Metastore table.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • We can use load command with out LOCAL to get data from HDFS location into Spark Metastore Table.

  • User running load command from HDFS location need to have write permissions on the source location as data will be moved (deleted on source and copied to Spark Metastore table)

  • Make sure user have write permissions on the source location.

  • First we need to copy the data into HDFS location where user have write permissions.

import sys.process._
val username = System.getProperty("user.name")
s"hadoop fs -rm -R /user/${username}/retail_db/orders" !
s"hadoop fs -mkdir /user/${username}/retail_db" !
s"hadoop fs -put -f /data/retail_db/orders /user/${username}/retail_db" !
s"hadoop fs -ls /user/${username}/retail_db/orders" !
  • Here is the script which will truncate the table and then load the data from HDFS location to Hive table.

%%sql

USE itversity_retail
%%sql

TRUNCATE TABLE orders
%%sql

LOAD DATA INPATH '/user/itversity/retail_db/orders' 
  INTO TABLE orders
s"hadoop fs -ls /user/${username}/warehouse/${username}_retail.db/orders" !
s"hadoop fs -ls /user/${username}/retail_db/orders" !
%%sql

SELECT * FROM orders LIMIT 10
%%sql

SELECT count(1) FROM orders
  • Using Spark SQL with Python or Scala

spark.sql("USE itversity_retail")
spark.sql("TRUNCATE TABLE orders")
spark.sql("""
LOAD DATA INPATH '/user/itversity/retail_db/orders' 
  INTO TABLE orders""")
s"hadoop fs -ls /user/${username}/retail_db/orders" !
spark.sql("SELECT * FROM orders LIMIT 10")
spark.sql("SELECT count(1) FROM orders")
  • If you look at /user/training/retail_db orders directory would have been deleted.

  • Move is much faster compared to copying the files by moving blocks around, hence Hive load command from HDFS location will always try to move files.