Understanding Warehouse DirectoryΒΆ

Let us go through the details related to Spark Metastore Warehouse Directory.

  • A Database in Spark SQL is nothing but directory in underlying file system like HDFS.

  • A Spark Metastore Table is nothing but directory in underlying file systems like HDFS.

  • A Partition of Spark Metastore Table is nothing but directory in underlying file systems like HDFS under table.

  • Warehouse Directory is the base directory where directories related to databases, tables go by default.

  • It is controlled by spark.sql.warehouse.dir. You can get the value by saying SET spark.sql.warehouse.dir;

Do not overwrite this property Spark SQL CLI. It will not have any effect.

  • Underlying directory for a database will have .db extension.

import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    master("yarn").
    appName(s"${username} | Spark SQL - Getting Started").
    getOrCreate
%%sql

SET spark.sql.warehouse.dir