Sep 14, 2024

Guide to Optimize Databricks and Spark Based Workloads

Introduction

Optimizing query performance in Databricks is essential for efficient data processing, especially when dealing with large datasets and complex operations like joins. Techniques such as Broadcast Hash Joins (BHJ), data skew handling, and preventing data spilling are critical to ensure faster, more reliable workflows. By fine-tuning these operations, you can drastically improve job execution times and resource utilization, making your Databricks jobs run smoother.

Broadcast Hash Joins

  • In Databricks, Broadcast Hash Join (BHJ) is a powerful technique that minimizes data shuffling by broadcasting smaller tables across worker nodes.

How to avoid it?

  1. Tuning Broadcast Thresholds:

    • Increase broadcast thresholds for larger tables using Databricks configurations:

      set spark.sql.autoBroadcastJoinThreshold = 209715200; -- 200MB 
      set spark.databricks.adaptive.autoBroadcastJoinThreshold = 209715200; -- 200MB

    Explicitly Broadcasting Small Tables:

    • Tip: Use hints to broadcast small tables and avoid unnecessary shuffling.

      SELECT /*+ BROADCAST(t) */ * FROM <table-name> t;

How it will impact:

  • Boost query performance by eliminating data shuffling.

  • Optimize resource usage with explicit broadcast controls in Databricks.

Real-Life Example:

For a sales dashboard, broadcasting a 5MB customer_details table while joining with a large transactions table can cut query time from 5 minutes to under 30 seconds.

Data Skewness

  • Data skewness occurs when the data is unevenly distributed, leading to a few CPU cores processing large chunks of data, slowing performance or failure in Databricks jobs.

How to avoid it?

  1. Filter Skewed Values: Filter out skewed values (e.g., nulls) to eliminate the skew.

  2. Skew Hints: Identify skewed values and explicitly tell Spark to handle them:

    SELECT /*+ SKEW('table', 'column_name', (value1, value2)) */ * FROM table;
  1. AQE Skew Optimization: Use Adaptive Query Execution (AQE) to automatically optimize data skew:

    set spark.sql.adaptive.skewJoin.enabled = true;
  1. Salting: Add random suffixes to skewed values, breaking large partitions into smaller ones.

How it will impact:

  • Faster job execution with balanced data distribution.

  • Reduced failures by managing heavy partitions efficiently.

Data Spilling

  • Data spilling happens when Spark tasks run out of memory and have to write intermediate data to disk, slowing down your job.

  • It often occurs during wide transformations like joins and aggregations, especially when data is too large to fit in memory.

How to avoid it?

  1. AQE Auto-Tuning: Automatically tune the number of shuffle partitions to avoid data spills.

    set spark.sql.shuffle.partitions=auto
  2. Manually Fine-Tune Shuffle Partitions: manually set the correct number of partitions when auto-tuning doesn’t work.

    spark.conf.set("spark.sql.shuffle.partitions", 2*sc.defaultParallelism)

How it will impact:

  • Reduced Disk I/O: By tuning shuffle partitions, less data is spilled to disk.

  • Simplified Tuning: AQE automates partition optimization, reducing the need for manual adjustments.

Conclusion

Incorporating techniques like Broadcast Hash Joins, managing data skewness, and avoiding data spilling are key strategies to maximize performance and reliability in Databricks. Through careful tuning, the use of hints, and leveraging adaptive query execution (AQE), you can minimize costly data shuffling and prevent bottlenecks. The result is faster query times, more balanced resource usage, and a more scalable data platform.

Table of Content

Title

Subscribe to get notified.

Subscribe to get notified.

Subscribe to get notified.

Want to hear about our latest Datalakehouse and Databricks learnings?

Subscribe to get notified.

Want to hear about our latest Datalakehouse and Databricks learnings?

Subscribe to get notified.

Make your data engineering process efficient and cost effective. Feel free to reach for a data infrastructure audit.

How WTD Can help

- Data experts for implementing projects

- On-demand data team for support

Make your data engineering process efficient and cost effective. Feel free to reach for a data infrastructure audit.

How WTD Can help

- Data experts for implementing projects

- On-demand data team for support

Make your data engineering process efficient and cost effective. Feel free to reach for a data infrastructure audit.

How WTD Can help

- Data experts for implementing projects

- On-demand data team for support