Spark SQL – Overview

Let us get an overview of Spark SQL.

Here are the standard operations which we typically perform as part of processing the data. In Spark we can perform these using Data Frame APIs or Spark SQL.

  • Selection or Projection – select clause

    • It is also called as row level transformations.

    • Apply standardization rules (convert names and addresses to upper case).

    • Mask partial data (SSN and Date of births).

  • Filtering data – where clause

    • Get orders based on date or product or category.

  • Joins – join (supports outer join as well)

    • Join multiple data sets.

  • Aggregations – group by and aggregations with support of functions such as sum, avg, min, max etc

    • Get revenue for a given order

    • Get revenue for each order

    • Get daily revenue

  • Sorting – order by

    • Sort the final output by date.

    • Sort the final output by date, then by revenue in descending order.

    • Sort the final output by state or province, then by revenue in descending order.

  • Analytics Functions – aggregations, ranking and windowing functions

    • Get top 5 stores by revenue for each state.

    • Get top 5 products by revenue in each category.