Overview of Spark SQL PropertiesΒΆ

Let us understand details about Spark SQL properties which control Spark SQL run time environment.

  • Spark SQL inherits properties defined for Spark. There are some Spark SQL related properties as well and these are applicable even for Data Frames.

  • We can review these properties using Management Tools such as Ambari or Cloudera Manager Web UI

  • Spark run time behavior is controlled by HDFS Properties files, YARN Properties files, Hive Properties files etc in those clusters where Spark is integrated with Hadoop and Hive.

  • We can get all the properties using SET; in Spark SQL CLI

Let us review some important properties in Spark SQL.

spark.sql.warehouse.dir
spark.sql.catalogImplementation
  • We can review the current value using SET spark.sql.warehouse.dir;

import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    master("yarn").
    appName(s"${username} | Spark SQL - Getting Started").
    getOrCreate
%%sql

SET
%%sql

SET spark.sql.warehouse.dir
Waiting for a Spark session to start...
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|spark.sql.warehou...|/user/itversity/w...|
+--------------------+--------------------+
  • Properties with default values does not show up as part of SET command. But we can check and overwrite the values - for example

%%sql

SET spark.sql.shuffle.partitions
  • We can overwrite property by setting value using the same SET command, eg:

%%sql

SET spark.sql.shuffle.partitions=2