Guide to Optimize Databricks and Spark Based Workloads

WTD Analytics

Services

Industries

About Us

Blog

Get in touch

WTD Analytics

Get in touch

WTD

Analytics

Get in touch

Sep 14, 2024

Guide to Optimize Databricks and Spark Based Workloads

Introduction

Optimizing query performance in Databricks is essential for efficient data processing, especially when dealing with large datasets and complex operations like joins. Techniques such as Broadcast Hash Joins (BHJ), data skew handling, and preventing data spilling are critical to ensure faster, more reliable workflows. By fine-tuning these operations, you can drastically improve job execution times and resource utilization, making your Databricks jobs run smoother.

Broadcast Hash Joins

In Databricks, Broadcast Hash Join (BHJ) is a powerful technique that minimizes data shuffling by broadcasting smaller tables across worker nodes.

How to avoid it?

Tuning Broadcast Thresholds:
- Increase broadcast thresholds for larger tables using Databricks configurations:
  set spark.sql.autoBroadcastJoinThreshold = 209715200; -- 200MB set spark.databricks.adaptive.autoBroadcastJoinThreshold = 209715200; -- 200MB
Explicitly Broadcasting Small Tables:
- Tip: Use hints to broadcast small tables and avoid unnecessary shuffling.
  SELECT /*+ BROADCAST(t) */ * FROM <table-name> t;

How it will impact:

Boost query performance by eliminating data shuffling.
Optimize resource usage with explicit broadcast controls in Databricks.

Real-Life Example:

For a sales dashboard, broadcasting a 5MB customer_details table while joining with a large transactions table can cut query time from 5 minutes to under 30 seconds.

Data Skewness

Data skewness occurs when the data is unevenly distributed, leading to a few CPU cores processing large chunks of data, slowing performance or failure in Databricks jobs.

How to avoid it?

Filter Skewed Values: Filter out skewed values (e.g., nulls) to eliminate the skew.

Skew Hints: Identify skewed values and explicitly tell Spark to handle them:

SELECT /*+ SKEW('table', 'column_name', (value1, value2)) */ * FROM table;

AQE Skew Optimization: Use Adaptive Query Execution (AQE) to automatically optimize data skew:
set spark.sql.adaptive.skewJoin.enabled = true;

Salting: Add random suffixes to skewed values, breaking large partitions into smaller ones.

How it will impact:

Faster job execution with balanced data distribution.
Reduced failures by managing heavy partitions efficiently.

Data Spilling

Data spilling happens when Spark tasks run out of memory and have to write intermediate data to disk, slowing down your job.
It often occurs during wide transformations like joins and aggregations, especially when data is too large to fit in memory.

How to avoid it?

AQE Auto-Tuning: Automatically tune the number of shuffle partitions to avoid data spills.
set spark.sql.shuffle.partitions=auto
Manually Fine-Tune Shuffle Partitions: manually set the correct number of partitions when auto-tuning doesn’t work.
spark.conf.set("spark.sql.shuffle.partitions", 2*sc.defaultParallelism)

How it will impact:

Reduced Disk I/O: By tuning shuffle partitions, less data is spilled to disk.
Simplified Tuning: AQE automates partition optimization, reducing the need for manual adjustments.

Conclusion

Incorporating techniques like Broadcast Hash Joins, managing data skewness, and avoiding data spilling are key strategies to maximize performance and reliability in Databricks. Through careful tuning, the use of hints, and leveraging adaptive query execution (AQE), you can minimize costly data shuffling and prevent bottlenecks. The result is faster query times, more balanced resource usage, and a more scalable data platform.