Nov 10, 2024

Smart Tagging for Cloud Data Management with Unity Catalog

Introduction

As data volumes and complexities grow, tagging data becomes essential for effective cloud data management. It helps our teams with data discovery, tracking, compliance, and cost control. Unity Catalog in Databricks offers an advanced tagging system, making it easier for organizations to apply structured tags across data assets like tables, views, schemas, models, and more. This functionality is invaluable for IT, data management, and finance leaders who need streamlined, transparent access to their data’s metadata.

In this blog, we’ll explore Unity Catalog’s tagging features, its application across various data objects, how to retrieve tag information programmatically, and how tagging can support data management strategies for CIOs, CDOs, and data teams.

Understanding Unity Catalog Tags and Their Benefits

Unity Catalog tags enable us to assign key-value pairs to data assets, making it easy to categorize, search, and manage data. This allows stakeholders to answer questions such as:

  • Which department owns this dataset?

  • What project or cost center is responsible for this data?

  • Does this data need special handling or security due to sensitivity?

Key Benefits of Using Tags in Unity Catalog:

  • Improved Searchability: Quickly locate data assets by tag, especially for large, distributed teams.

  • Enhanced Organization: Organize datasets by criteria like function or compliance needs.

  • Better Governance: Tags help meet regulatory and security requirements by providing additional metadata on sensitive data.

  • Cost Management: Tagging for project or usage tracking enables teams to monitor and optimize data-related expenses.

Where Tags Can Be Applied in Unity Catalog

Unity Catalog’s versatile tagging system spans a range of data assets:

  • Catalogs: Top-level containers organizing collections of schemas and tables.

  • Schemas: Logical groupings of tables and views within a catalog.

  • Tables and Views: Core data structures storing rows and query results.

  • Volumes: Unstructured data storage containers.

  • Registered Models and Model Versions: ML models managed within Databricks.

  • Columns: Individual fields within tables for precise data management.

Unity Catalog also allows us to search for tags across tables, views, and columns, making it easier to locate datasets based on tag terms.

Prerequisites and Constraints for Tagging

To add or modify tags, we need APPLY TAG permissions, along with permissions on the object’s catalog and schema. Key constraints include:

  • Limit on Tags: Each object supports up to 20 tags.

  • Character Limits: Tag keys are limited to 255 characters; values can be up to 1000 characters.

  • Exact Term Matching for Searches: Tag search requires precise term matching for more accurate results.

Applying Tags in Unity Catalog: Step-by-Step Guide

1. Using the Catalog Explorer Interface

The Databricks Catalog Explorer is a user-friendly interface for applying and managing tags:

  • Navigate to the Catalog Explorer: Select the Catalog icon.

  • Choose the Object: Select the catalog, schema, table, or view we wish to tag.

  • Add or Edit Tags: Click Add Tags for new tags or Edit for existing ones.

  • Apply Tags to Columns: Tags can also be applied directly to table columns.

2. Using SQL Commands (Databricks Runtime 13.3+)

For more granular or automated tagging, SQL commands allow us to set tags across multiple objects:

ALTER TABLE my_catalog.my_schema.my_table SET TAGS ('project' = 'Finance', 'sensitivity' = 'High')

Example Use Cases of Tagging

Cost Center Tracking

Tags help allocate storage and processing costs to specific departments:

  • Tag Key: cost_center

  • Tag Value: Marketing

Data Sensitivity Classification

For sensitive data, tags help enforce security and compliance requirements:

  • Tag Key: classification

  • Tag Value: PII

Environment Segmentation

Tags differentiate environments, making it easier to manage resources in Dev, QA, or Production:

  • Tag Key: environment

  • Tag Value: production

Searching for Tagged Objects in Unity Catalog

Unity Catalog supports tag-based search in the Databricks workspace:

  • How It Works: Input a tag key or key-value pair to filter search results.

  • Permission-Based Results: Only assets users have access to are shown, maintaining secure access control.

For example, a search query like classification:PII will display only datasets tagged as containing sensitive information.

Retrieving Tag Information with INFORMATION_SCHEMA Queries

For programmatic tag retrieval, Unity Catalog includes INFORMATION_SCHEMA tables that store metadata across catalogs, schemas, tables, volumes, and columns.

Sample Queries:

  1. Catalog Tags

    SELECT catalog_name, tag_name, tag_value FROM
  2. Schema Tags

    SELECT catalog_name, schema_name, tag_name, tag_value FROM
  3. Table and View Tags

    SELECT catalog_name, schema_name, table_name, tag_name, tag_value FROM
  4. Volume Tags

    SELECT catalog_name, volume_name, tag_name, tag_value FROM
  5. Column Tags

    SELECT catalog_name, schema_name, table_name, column_name, tag_name, tag_value FROM

These queries enable detailed reports for governance and compliance audits across Databricks resources.

Practical Tips for Implementing a Tagging Strategy

  • Start Small, Then Scale: Begin with a few essential tags like cost_center or project, gradually expanding to more specific use cases.

  • Define Ownership: Assign a governance team to oversee tagging policies, maintaining tag accuracy and relevance.

  • Enforce Tags Using Policies: Set mandatory tags like Department or Project to ensure key data assets are properly labeled.

Monitoring Costs with Tags: A Key Strategy for Cloud Cost Management

Tags can also be applied to Databricks resources like clusters and SQL warehouses, helping track costs by team or project. Default tags propagate to AWS EC2 instances created from a pool, streamlining cost attribution.

Example: Tagging clusters with Team:DataScience can help finance teams track usage costs for that team.

Conclusion

Unity Catalog’s tagging features provide powerful tools for structured and efficient data management. CIOs, CDOs, and data teams benefit from streamlined data discovery, improved compliance, and actionable insights into cost drivers. By developing a clear tagging strategy and scaling it thoughtfully, organizations can maintain robust data governance, compliance, and cost control across complex data ecosystems.

Table of Content

Title

Subscribe to get notified.

Subscribe to get notified.

Subscribe to get notified.

Want to hear about our latest Datalakehouse and Databricks learnings?

Subscribe to get notified.

Want to hear about our latest Datalakehouse and Databricks learnings?

Subscribe to get notified.

Make your data engineering process efficient and cost effective. Feel free to reach for a data infrastructure audit.

How WTD Can help

- Data experts for implementing projects

- On-demand data team for support

Make your data engineering process efficient and cost effective. Feel free to reach for a data infrastructure audit.

How WTD Can help

- Data experts for implementing projects

- On-demand data team for support

Make your data engineering process efficient and cost effective. Feel free to reach for a data infrastructure audit.

How WTD Can help

- Data experts for implementing projects

- On-demand data team for support