Loading Data - Append and OverwriteΒΆ

Let us understand different approaches to load the data into Spark Metastore table.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our 10 node state of the art cluster/labs to learn Spark SQL using our unique integrated LMS.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
  • INTO TABLE will append in the existing table

  • If we want to overwrite we have to specify OVERWRITE INTO TABLE

%%sql

USE itversity_retail
%%sql

SELECT count(1) FROM orders
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders" !
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/orders' 
  INTO TABLE orders
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders" !
%%sql

SELECT count(1) FROM orders
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/orders' 
  OVERWRITE INTO TABLE orders
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders" !
%%sql

SELECT count(1) FROM orders
  • Using Spark SQL with Python or Scala

spark.sql("USE itversity_retail")
spark.sql("SELECT count(1) FROM orders").show()
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders" !
spark.sql("""
LOAD DATA LOCAL INPATH '/data/retail_db/orders' 
  INTO TABLE orders
""")
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders" !
spark.sql("SELECT count(1) FROM orders").show()
spark.sql("""
LOAD DATA LOCAL INPATH '/data/retail_db/orders' 
  OVERWRITE INTO TABLE orders
""")
s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/orders" !
spark.sql("SELECT count(1) FROM orders").show()