Top Essential SQL Functions in Databricks for 2024: A Comprehensive Guide

WTD Analytics

Services

Industries

About Us

Blog

Get in touch

WTD Analytics

Get in touch

WTD

Analytics

Get in touch

Jun 23, 2024

Top Essential SQL Functions in Databricks for 2024: A Comprehensive Guide

We delved into features of Databricks SQL which can provide Quality-of-life improvements, each designed to streamline your workflows and enhance data processing efficiency. From handling semi-structured data with ease to leveraging powerful machine learning capabilities directly in SQL, these features are set to revolutionize how you interact with your data. Let's recap the key insights and practical applications we've covered.

1. Variant Data Type in Databricks

What is the Variant Data Type?

The Variant data type in Databricks provides a unified way to store and query semi-structured data. It allows you to work with different data formats such as JSON, Parquet, and Avro without having to convert them to a specific schema upfront.

Why Use Variant Data Type?

Flexibility: Handles diverse data formats seamlessly.
Simplified Schema Management: No need for upfront schema definition.
Enhanced Query Performance: Optimized for querying semi-structured data.

Example Use Cases with Code:

1. Storing JSON Data:

-- Creating a Delta table with Variant data type
CREATE TABLE events (data VARIANT);

-- Inserting JSON data
INSERT INTO events VALUES ('{"event_type": "click", 
                           "event_details": {
                              "page": "home", 
                              "timestamp": "2024-01-01T12:00:00Z"}
                            }');

2. Querying Variant Data:

-- Querying specific fields from JSON data
SELECT
    data:event_type AS eventType,
    data:event_details:page AS page,
    data:event_details:timestamp AS eventTimestamp
FROM

Output:

+----------+------+------------------------+
|eventType | page | eventTimestamp         |
+----------+------+------------------------+
| click    | home | 2024-01-01T12:00:00Z

How Variant Data Type Standardizes Semi-Structured Data

The Variant data type simplifies the process of managing semi-structured data by providing a consistent way to store and query different formats.

It eliminates the need for complex transformations and schema management, making your data pipelines more efficient and easier to maintain.

Results

Using the Variant data type leads to:

Faster Query Performance: Querying JSON as Variant is 8x faster than querying JSON as String format.
Reduced Complexity: Simplifies data ingestion and transformation processes.
Enhanced Flexibility: Easily integrates diverse data formats without extensive preprocessing.

2. Named Parameters in Databricks SQL

What are Named Parameters?

Named parameters in Databricks SQL allow you to assign specific names to the parameters in your SQL functions. This feature makes your SQL queries more readable and easier to maintain by clearly indicating the purpose of each parameter.

Why Use Named Parameters?

Clarity: Improves readability by clearly identifying the purpose of each parameter.
Maintainability: Simplifies query updates and modifications.
Flexibility: Allows you to specify parameters in any order, enhancing ease of use.

Example Use Cases:

Updating Records Efficiently

Imagine you are managing a sales database and need to update the discount rate for a specific product based on certain conditions. Named parameters make this task straightforward and clear.

Code Explanation:

Table Creation:

CREATE TABLE sales (
  sale_id INT,
  product STRING,
  amount DOUBLE,
  discount DOUBLE
)

This command creates a table named sales with four columns: sale_id, product, amount, and discount.

Inserting Sample Data:

INSERT INTO sales VALUES
(1, 'Laptop', 1200.50, 0.10),
(2, 'Smartphone', 800.75, 0.15)

Here, two rows of sample data are inserted into the sales table.

Updating Records with Named Parameters:
UPDATE sales SET discount = (CASE WHEN product = :product_name THEN :new_discount ELSE discount END) WHERE sale_id = :
- This SQL command updates the discount column in the sales table. It uses named parameters (:product_name, :new_discount, :sale_id) to make the query more readable and maintainable.
- The CASE statement checks if the product matches the :product_name parameter and sets the discount to :new_discount if true; otherwise, it retains the existing discount.
Setting and Using Named Parameters:
SET @sale_id = 1; SET @product_name = 'Laptop'; SET @new_discount = 0.20; EXECUTE IMMEDIATE 'UPDATE sales SET discount = :new_discount WHERE sale_id = :sale_id' USING @new_discount,
- Named parameters @sale_id, @product_name, and @new_discount are set with specific values.
- The EXECUTE IMMEDIATE statement executes the UPDATE command using the named parameters, updating the discount for the specified sale.

Output:

+--------+-----------+--------+---------+
| sale_id| product   | amount | discount|
+--------+-----------+--------+---------+
|      1 | Laptop    | 1200.5 | 0.20    |
|      2 | Smartphone| 800.75 | 0.15

How Named Parameters Enhance SQL Queries

Named parameters significantly enhance the readability and maintainability of SQL queries by allowing you to specify parameters with clear, descriptive names. This reduces the potential for errors and makes it easier for others to understand your queries.

Key Benefits:

Improved Readability: Queries are easier to understand with descriptive parameter names.
Simplified Debugging: Easier to identify and fix issues with clear parameter naming.
Flexible Parameter Order: Specify parameters in any order, making query writing more intuitive.

3. INSERT BY NAME in Databricks SQL

What is INSERT BY NAME?

INSERT BY NAME allows you to insert data into a table without worrying about the exact order of columns in your SELECT statement. This feature automatically matches the columns from your SELECT statement to the target table columns by name, making the process more intuitive and less error-prone.

Before the INSERT BY NAME feature was introduced, inserting data into tables with many columns required meticulous attention to the order of columns. This could be cumbersome and error-prone, especially with wide tables. Let's look at a comparison between the traditional method and the new method using INSERT BY NAME.

Solving Daily Problems with INSERT BY NAME

Consider a scenario where you are working with a large and complex data pipeline, often dealing with tables having dozens of columns. Here’s how INSERT BY NAME can simplify your daily work:

Example Problem

You have a table SalesData with many columns, and the order of columns in your source data frequently changes. Manually matching these columns every time you insert data is time-consuming and error-prone.

Creating the Table:

CREATE TABLE SalesData (
    sale_id INT,
    product_id INT,
    product_name STRING,
    quantity INT,
    price DOUBLE,
    sale_date DATE,
    customer_id INT,
    region STRING
)

Inserting Data with Traditional Method:

With traditional SQL, you must ensure the column order in your INSERT statement matches the table:

INSERT INTO SalesData (sale_id, product_id, product_name, quantity, price, sale_date, customer_id, region)
SELECT 1001, 2002, 'Laptop', 5, 1200.50, '2023-06-18', 3003, 'North America'

Inserting Data with INSERT BY NAME:

Now, you can focus on the data rather than the order of columns:

INSERT INTO SalesData BY NAME
SELECT 1001 AS sale_id, 2002 AS product_id, 'Laptop' AS product_name, 5 AS quantity, 1200.50 AS price, '2023-06-18' AS sale_date, 3003 AS customer_id, 'North America' AS

Or even in any order:

INSERT INTO SalesData BY NAME
SELECT 'North America' AS region, 3003 AS customer_id, '2023-06-18' AS sale_date, 1200.50 AS price, 5 AS quantity, 'Laptop' AS product_name, 2002 AS product_id, 1001 AS

Benefits in Daily Use

Reduced Complexity: No need to manually order columns, reducing mental overhead and potential for errors.
Improved Readability: SQL statements become more readable and maintainable.
Time Savings: Spend less time checking and rechecking column orders, speeding up development.

By using INSERT BY NAME, you can simplify your SQL data insertion tasks, making your daily work more efficient and less error-prone.

Link to Databricks Docs

4. Unnamed Parameter Markers in Databricks SQL

What are Unnamed Parameter Markers?

Unnamed Parameter Markers (?) are placeholders in SQL statements, allowing you to dynamically pass values during execution. This feature enhances query flexibility and security by preventing SQL injection attacks.

Why Use Unnamed Parameter Markers?

Dynamic Query Execution: Easily execute queries with varying parameters without changing the query structure.
Enhanced Security: Protect against SQL injection by using parameterized queries.
Simplified Code Maintenance: Reduce code complexity by reusing the same query structure with different values.

With Unnamed Parameter Markers, you can efficiently handle dynamic queries, improving both performance and security in Databricks SQL.

5. Run Federated Queries on Snowflake from Databricks

Running federated queries on Snowflake from Databricks is now in public preview! This powerful feature allows you to seamlessly query Snowflake data directly from Databricks, enabling a unified data analysis experience without moving data across platforms.

Setting Up Lakehouse Federation: Step-by-Step Guide

To run federated queries on Snowflake data, follow these steps to set up Lakehouse Federation in your Databricks environment:

Workspace Requirements:
- Ensure your workspace is enabled for Unity Catalog.
Compute Requirements:
- Ensure network connectivity from your Databricks Runtime cluster or SQL warehouse to Snowflake.
- Use Databricks Runtime 13.3 LTS or above in shared or single-user access mode.
- SQL warehouses must be Pro or Serverless.
Permissions:
- To create a connection, you need to be a metastore admin or have the CREATE CONNECTION privilege.
- To create a foreign catalog, you need CREATE CATALOG permission on the metastore and ownership or CREATE FOREIGN CATALOG privilege on the connection.

Execution Guide

Create a Security Integration in Snowflake (Optional)

If you prefer using single sign-on (SSO) for authentication, create a security integration in the Snowflake console:

CREATE SECURITY INTEGRATION <integration-name>
TYPE = oauth
ENABLED = true
OAUTH_CLIENT = custom
OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
OAUTH_REDIRECT_URI = 'https://<workspace-url>/login/oauth/snowflake.html'
OAUTH_ISSUE_REFRESH_TOKENS = TRUE
OAUTH_REFRESH_TOKEN_VALIDITY = <duration-in-seconds>
OAUTH_ENFORCE_PKCE = TRUE

Replace <integration-name>, <workspace-url>, and <duration-in-seconds> with your specific details.

Create a Connection

A connection specifies the path and credentials for accessing Snowflake. Use the following SQL commands in a Databricks notebook or SQL query editor:

CREATE CONNECTION <connection-name> TYPE snowflake
OPTIONS (
  host '<hostname>',
  port '<port>',
  sfWarehouse '<warehouse-name>',
  user secret ('<secret-scope>','<secret-key-user>'),
  password secret ('<secret-scope>','<secret-key-password>')
)

Replace placeholder values with actual credentials and secure them using Databricks secrets.

Create a Foreign Catalog

A foreign catalog mirrors your Snowflake database in Unity Catalog, allowing you to manage and query Snowflake data:

CREATE FOREIGN CATALOG [IF NOT EXISTS] <catalog-name> USING CONNECTION <connection-name>
OPTIONS (database '<database-name>')

Replace <catalog-name>, <connection-name>, and <database-name> with your specific details.

Supported Pushdowns

Federated queries on Snowflake support the following operations:

Filters
Projections
Limit
Joins
Aggregates (e.g., AVG, COUNT, MAX, MIN, SUM)
Functions (e.g., string, mathematical, date, time, and miscellaneous functions)
Window functions (e.g., DenseRank, Rank, RowNumber)
Sorting

Data Type Mappings

When querying Snowflake from Databricks, data types map as follows:

Snowflake decimal, number, numeric → DecimalType
Snowflake bigint, byteint, int, integer, smallint, tinyint → IntegerType
Snowflake float, float4, float8 → FloatType
Snowflake double, double precision, real → DoubleType
Snowflake char, character, string, text, time, varchar → StringType
Snowflake binary → BinaryType
Snowflake boolean → BooleanType
Snowflake date → DateType
Snowflake datetime, timestamp, timestamp_ltz, timestamp_ntz, timestamp_tz → TimestampType

With Lakehouse Federation, you can leverage the power of Databricks and Snowflake together, providing a seamless and efficient way to query and analyze your data across platforms.

6. The `ai_classify` Function in Databricks SQL

What is ai_classify?

The ai_classify function leverages Databricks’ machine learning capabilities to classify text data directly within your SQL queries. It uses pre-trained models to identify and categorize text into predefined classes, streamlining the process of text classification without the need for external tools.

Why Use ai_classify?

Seamless Integration: Incorporate advanced machine learning directly into your SQL workflows.
Efficiency: Quickly classify large volumes of text data.
Flexibility: Apply pre-trained models to various classification tasks.

Example Use Case:

Suppose you have a table of product descriptions, and you want to classify each product into categories such as electronics, clothing, and home goods.

Explanation:

Creating the Table:

-- Create a sample products table
CREATE TABLE product_catalog (
  product_id INT,
  product_description STRING
)

The code creates a table customer_reviews to store sample reviews.
Each review has an review_id and review_text.

Inserting Data:

-- Insert sample data into the products table
INSERT INTO product_catalog VALUES
(1, 'Latest model smartphone with 128GB storage.'),
(2, 'Comfortable cotton t-shirt in various sizes.'),
(3, 'Stainless steel kitchen knife set.')

The table is populated with four different reviews.

Classifying Reviews:

-- Use the ai_classify function to categorize the products
SELECT
  product_id,
  product_description,
  ai_classify(product_description, 'product_classification_model') AS category
FROM

The ai_classify function is used to classify the review text into categories using a specified machine learning model (feedback_model in this case).

Output:

+------------+---------------------------------------------------+-----------------+
| product_id | product_description                               | category        |
+------------+---------------------------------------------------+-----------------+
| 1          | Latest model smartphone with 128GB storage.       | electronics     |
| 2          | Comfortable cotton t-shirt in various sizes.      | clothing        |
| 3

Results:

Using the ai_classify function, you can easily categorize product descriptions, enhancing your ability to organize and analyze product data. The function can be applied to various use cases such as inventory management, marketing campaigns, and more.

More Useful Example:

Let’s consider a scenario where you need to classify incoming customer emails into different support categories such as technical support, billing, and general inquiries.

-- Create a sample emails table
CREATE TABLE customer_emails (
  email_id INT,
  email_content STRING
);

-- Insert sample data into the emails table
INSERT INTO customer_emails VALUES
(1, 'I need help with setting up my new router.'),
(2, 'I have a question about my last bill.'),
(3, 'Can you provide information on your return policy?');

-- Use the ai_classify function to categorize the emails
SELECT
  email_id,
  email_content,
  ai_classify(email_content, 'email_classification_model') AS category
FROM

Output:

+----------+-----------------------------------------------------+-------------------+
| email_id | email_content                                       | category          |
+----------+-----------------------------------------------------+-------------------+
| 1        | I need help with setting up my new router.          | technical support |
| 2        | I have a question about my last bill.               | billing           |
| 3        | Can you provide information on your return policy?  | general inquiry

Results:

By applying the ai_classify function to customer emails, you can automatically route emails to the appropriate support teams for more efficient handling and quicker resolution times. This leads to improved customer satisfaction and operational efficiency.

Conclusion

We’ve highlighted several Databricks SQL features that offer significant quality-of-life improvements for data engineers and analysts. From the flexibility of the Variant data type to the efficiency of named parameters and the intuitive nature of INSERT BY NAME, each feature brings a unique advantage to your data workflows. Leveraging these tools not only streamlines your processes but also enhances the performance and maintainability of your SQL queries, empowering you to achieve more with your data.