Setting Up and Managing Data Quality Monitors in Databricks

WTD Analytics

Services

Industries

About Us

Blog

Get in touch

WTD Analytics

Get in touch

WTD

Analytics

Get in touch

Nov 10, 2024

Setting Up and Managing Data Quality Monitors in Databricks

Introduction

Monitoring data quality is like having a “health check” for our data. With Databricks, we can set up monitors to track trends, flag unusual changes, and send alerts if there are issues—all without a ton of manual work. In this guide, we’ll walk through creating monitors on Delta tables registered in Unity Catalog, using easy-to-follow steps.

1. Setting Up the Databricks SDK

To get started, we’ll install and set up the Databricks SDK. This gives us access to the tools we’ll need for data quality monitoring.

Step 1: Install the Databricks SDK if you haven’t already.
%pip install "databricks-sdk>=0.28.0"
Step 2: Authenticate to your Databricks account following these instructions.

2. Choosing a Monitoring Profile

Databricks offers three types of monitoring profiles. Each one has a different purpose, so let's find the one that matches our needs:

TimeSeries: Tracks data changes over time. Great for trend monitoring.
InferenceLog: Adds model performance metrics—ideal for AI projects.
Snapshot: Takes a full “snapshot” of the table for each refresh, giving us a complete picture.

3. Setting Up a TimeSeries Monitor

Let’s start with a TimeSeries monitor to track data over time. This type of monitor is ideal for keeping an eye on changes in key data points or metrics.

Try it Yourself:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorTimeSeries

w = WorkspaceClient()
w.quality_monitors.create(
  table_name=f"{catalog}.{schema}.{table_name}",
  assets_dir=f"/Workspace/Users/{user_email}/databricks_lakehouse_monitoring/{catalog}.{schema}.{table_name}",
  output_schema_name=f"{catalog}.{schema}",
  time_series=MonitorTimeSeries(timestamp_col="event_time", granularities=["30 minutes"])
)

timestamp_col: Specify which column holds the time information.
granularities: Define how frequently the monitor collects data, like every “30 minutes” or “1 day.”

4. Setting Up an InferenceLog Monitor

Need to track model performance? The InferenceLog monitor includes metrics such as model accuracy, making it useful for AI projects.

Code Example:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorInferenceLog, MonitorInferenceLogProblemType

w = WorkspaceClient()
w.quality_monitors.create(
  table_name=f"{catalog}.{schema}.{table_name}",
  assets_dir=f"/Workspace/Users/{user_email}/databricks_lakehouse_monitoring/{catalog}.{schema}.{table_name}",
  output_schema_name=f"{catalog}.{schema}",
  inference_log=MonitorInferenceLog(
        problem_type=MonitorInferenceLogProblemType.PROBLEM_TYPE_CLASSIFICATION,
        prediction_col="preds",
        timestamp_col="ts",
        granularities=["30 minutes", "1 day"],
        model_id_col="model_version",
        label_col="label"
  )
)

With this setup, we can track model predictions and see how they’re performing over time.

5. Setting Up a Snapshot Monitor

For cases where we want to track the entire dataset state, a Snapshot monitor is a good fit. It lets us see all changes, not just recent ones.

Code Example:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorSnapshot

w = WorkspaceClient()
w.quality_monitors.create(
  table_name=f"{catalog}.{schema}.{table_name}",
  assets_dir=f"/Workspace/Users/{user_email}/databricks_lakehouse_monitoring/{catalog}.{schema}.{table_name}",
  output_schema_name=f"{catalog}.{schema}",
  snapshot=MonitorSnapshot()
)

6. Refreshing Monitors: Keeping Data Current

Once monitors are set up, we’ll need to refresh them to keep the data up to date. This ensures we’re always seeing the latest information.

Refresh Example:

w.quality_monitors.run_refresh(
    table_name=f"{catalog}.{schema}.{table_name}"
)

7. Viewing Monitoring Results

Want to see your monitor’s activity history? Databricks makes it easy to check refresh records.

Example:

w.quality_monitors.list_refreshes(
    table_name=f"{catalog}.{schema}.{table_name}"
)

8. Automating Monitor Refreshes

Let’s make life easier by scheduling automatic refreshes. We can use cron expressions to set a specific refresh frequency.

Code Example:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorTimeSeries, MonitorCronSchedule

w = WorkspaceClient()
w.quality_monitors.create(
  table_name=f"{catalog}.{schema}.{table_name}",
  assets_dir=f"/Workspace/Users/{user_email}/databricks_lakehouse_monitoring/{catalog}.{schema}.{table_name}",
  output_schema_name=f"{catalog}.{schema}",
  time_series=MonitorTimeSeries(timestamp_col="ts", granularities=["1 day"]),
  schedule=MonitorCronSchedule(
        quartz_cron_expression="0 0 12 * * ?", # every day at noon
        timezone_id="PST",
    )
)

9. Setting Up Alerts for Monitor Failures

No one likes surprises, especially when it comes to data quality. Setting up alerts helps us stay informed if a monitor fails.

Code Example:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorTimeSeries, MonitorNotifications, MonitorDestination

w = WorkspaceClient()
w.quality_monitors.create(
  table_name=f"{catalog}.{schema}.{table_name}",
  assets_dir=f"/Workspace/Users/{user_email}/databricks_lakehouse_monitoring/{catalog}.{schema}.{table_name}",
  output_schema_name=f"{catalog}.{schema}",
  time_series=MonitorTimeSeries(timestamp_col="ts", granularities=["30 minutes"]),
  notifications=MonitorNotifications(
        on_failure=MonitorDestination(
            email_addresses=["your_email@domain.com"]
        )
    )
)

10. Managing Access and Deleting Monitors

Need to control access? Databricks lets us assign specific permissions to our monitors. And when it’s time to remove a monitor, it’s just a quick command.

Delete Example:

w.quality_monitors.delete(table_name=f"{catalog}.{schema}.{table_name}")

Conclusion

With these tools in place, we’re ready to monitor data quality automatically. Whether tracking trends, logging model performance, or setting up alerts, Databricks provides a comprehensive solution to keep data reliable and consistent. By having automated monitoring, we’re able to catch and resolve issues before they become bigger problems, ensuring a dependable data environment for our team.