Managing Data Quality with Delta Live Tables in Databricks

WTD Analytics

Services

Industries

About Us

Blog

Get in touch

WTD Analytics

Get in touch

WTD

Analytics

Get in touch

Oct 6, 2024

Managing Data Quality with Delta Live Tables in Databricks

Maintaining data quality is essential for any data-driven organization. Poor data quality can lead to incorrect conclusions, wasted resources, and a loss of trust in data systems. Databricks offers a powerful tool to help solve this challenge: Delta Live Tables (DLT). Delta Live Tables allow you to build robust, scalable data pipelines that ensure your data quality with built-in expectations.

In this blog, we'll explore how to use Delta Live Tables to manage data quality effectively. We’ll cover how expectations work, their role in maintaining data integrity, and how to implement them using Python and SQL. This guide aims to simplify these concepts so that even those new to Delta Live Tables can start using them confidently.

What are Delta Live Tables Expectations?

Expectations in Delta Live Tables are optional rules that act as data quality checks on the records passing through a data pipeline. Each expectation checks for specific conditions in your data, ensuring that only valid records move forward in your pipeline.

For example, you might want to ensure that a column containing dates always has values after a certain date, or that an important identifier field is never null. Expectations allow you to define these constraints, monitor your data for issues, and take action when data quality standards aren’t met.

Expectations consist of three parts:

Description: A unique identifier for the expectation. This helps in tracking metrics related to the data quality check.
Condition: A boolean expression that evaluates whether the data meets the expectation. This is typically a SQL expression that returns true or false.
Action: What to do when a record fails the expectation. You can choose to log a warning, drop the invalid record, or stop the pipeline altogether.

Why Use Expectations?

Expectations are critical for maintaining data quality in large-scale data pipelines. They help ensure:

Consistency: Expectations enforce rules that ensure data is consistent, which is crucial when integrating data from multiple sources.
Reliability: By catching issues early, you can avoid errors that might cause problems further down the line.
Transparency: You get clear visibility into data quality metrics, making it easier to diagnose problems and improve your pipeline.

Setting Up Expectations in Delta Live Tables

Delta Live Tables allow you to apply expectations using either Python decorators or SQL constraint clauses. Let’s break down how to use these expectations to manage data quality.

Retain Invalid Records

In some cases, you might want to keep records that violate certain expectations but still allow them into the target dataset. You can do this by using the expect operator, which lets you define a condition but doesn’t stop or drop records that fail it.

For example, let’s say you want to ensure that a timestamp column contains dates after 2012. You can define an expectation like this in Python:

@dlt.expect("valid timestamp", "timestamp > '2012-01-01'")

This expectation will ensure that any records with a timestamp earlier than 2012 will trigger a warning, but they’ll still be written to the target table.

Drop Invalid Records

When data quality is more critical, you might not want invalid records in your dataset. In this case, you can use the expect_or_drop operator, which drops records that don’t meet the specified condition.

For example, to ensure that certain important fields are not null, you could write:

@dlt.expect_or_drop(
  "valid_current_page", 
  "current_page_id IS NOT NULL AND current_page_title IS NOT NULL"
  )

In this case, any records where the current_page_id or current_page_title fields are null will be dropped from the dataset.

Fail on Invalid Records

Sometimes, invalid records are unacceptable and should stop the pipeline entirely. This might be the case when working with highly sensitive or critical data. For these situations, you can use the expect_or_fail operator.

Here’s an example of an expectation that stops the pipeline if the count column is less than or equal to zero:

@dlt.expect_or_fail("valid_count", "count > 0")

If a record with an invalid count is found, the pipeline will fail, preventing any further processing. You will need to manually intervene to correct the issue before re-running the pipeline.

Multiple Expectations in Python Pipelines

In some cases, you may want to check multiple conditions at once. Delta Live Tables in Python support defining multiple expectations with a dictionary where the key is the expectation’s name and the value is its condition.

For example, to check that both the count is greater than zero and current_page_id is not null, you can use the expect_all decorator:

@dlt.expect_all(
  {"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL"}
  )

If a record fails any of these expectations, you can choose to either retain it, drop it, or fail the pipeline.

Here’s an example of how to use expect_all_or_drop to drop records that fail any expectation:

@dlt.expect_all_or_drop(
  {"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL"}
)

Quarantine Invalid Data

Sometimes, you might want to separate invalid records for further inspection without dropping or failing the pipeline. You can do this by creating a quarantine table or view.

Here’s an example using a Python function to quarantine invalid records:

@dlt.table(
  name="farmers_market_quarantine",
  temporary=True,
  partition_cols=["is_quarantined"]
)
@dlt.expect_all(rules)
def farmers_market_quarantine():
  return (
    dlt.read("raw_farmers_market")
      .select("MarketName", "Website", "Location", "State")
      .withColumn("is_quarantined", expr(quarantine_rules))
  )

In this example, records that fail the expectation are flagged as quarantined, allowing you to inspect and handle them separately.

Advanced Validation with Delta Live Tables

For more complex validation, Delta Live Tables supports advanced use cases like comparing row counts across tables or checking that primary keys are unique.

For instance, you can validate that the row counts between two tables are equal using a materialized view:

CREATE OR REFRESH MATERIALIZED VIEW count_verification(
  CONSTRAINT no_rows_dropped EXPECT (a_count == b_count)
) AS SELECT * FROM
  (SELECT COUNT(*) AS a_count FROM LIVE.tbla),
  (SELECT COUNT(*) AS b_count FROM LIVE.tblb)

This query ensures that no rows were dropped between the two tables by comparing their row counts.

Making Expectations Portable and Reusable

Delta Live Tables allow you to define data quality rules separately from your pipeline code. You can store these rules in a Delta table or even in a Python module to make them reusable across multiple pipelines.

Here’s an example of how to store rules in a Delta table:

CREATE OR REPLACE TABLE rules AS 
SELECT "valid_count" AS name, "count > 0" AS constraint, "validation" AS tag

You can then load these rules into your pipeline and apply them to datasets:

def get_rules(tag):
  df = spark.read.table("rules")
  return {row["name"]: row["constraint"] for row in df.filter(col("tag") == tag).collect()}

@dlt.expect_all_or_drop(get_rules("validation"))

This makes it easy to manage data quality rules in one place and apply them to multiple pipelines, ensuring consistency and ease of maintenance.

Conclusion

Delta Live Tables is a powerful tool for managing data quality in Databricks. By using expectations, you can ensure that your data meets your quality standards, whether you're catching minor issues or enforcing strict rules that halt your pipeline on invalid data. With the ability to define and reuse expectations, quarantine invalid data, and perform advanced validation, Delta Live Tables give you control over the quality of your data in a way that scales with your pipeline.

Managing data quality doesn’t have to be complicated. With Delta Live Tables, it’s possible to automate data validation and maintain high-quality data pipelines with minimal manual intervention, ensuring that your organization can trust its data every step of the way.