Nov 10, 2024
Smart Tagging for Cloud Data Management with Unity Catalog
Introduction
As data volumes and complexities grow, tagging data becomes essential for effective cloud data management. It helps our teams with data discovery, tracking, compliance, and cost control. Unity Catalog in Databricks offers an advanced tagging system, making it easier for organizations to apply structured tags across data assets like tables, views, schemas, models, and more. This functionality is invaluable for IT, data management, and finance leaders who need streamlined, transparent access to their data’s metadata.
In this blog, we’ll explore Unity Catalog’s tagging features, its application across various data objects, how to retrieve tag information programmatically, and how tagging can support data management strategies for CIOs, CDOs, and data teams.
Understanding Unity Catalog Tags and Their Benefits
Unity Catalog tags enable us to assign key-value pairs to data assets, making it easy to categorize, search, and manage data. This allows stakeholders to answer questions such as:
Which department owns this dataset?
What project or cost center is responsible for this data?
Does this data need special handling or security due to sensitivity?
Key Benefits of Using Tags in Unity Catalog:
Improved Searchability: Quickly locate data assets by tag, especially for large, distributed teams.
Enhanced Organization: Organize datasets by criteria like function or compliance needs.
Better Governance: Tags help meet regulatory and security requirements by providing additional metadata on sensitive data.
Cost Management: Tagging for project or usage tracking enables teams to monitor and optimize data-related expenses.
Where Tags Can Be Applied in Unity Catalog
Unity Catalog’s versatile tagging system spans a range of data assets:
Catalogs: Top-level containers organizing collections of schemas and tables.
Schemas: Logical groupings of tables and views within a catalog.
Tables and Views: Core data structures storing rows and query results.
Volumes: Unstructured data storage containers.
Registered Models and Model Versions: ML models managed within Databricks.
Columns: Individual fields within tables for precise data management.
Unity Catalog also allows us to search for tags across tables, views, and columns, making it easier to locate datasets based on tag terms.
Prerequisites and Constraints for Tagging
To add or modify tags, we need APPLY TAG permissions, along with permissions on the object’s catalog and schema. Key constraints include:
Limit on Tags: Each object supports up to 20 tags.
Character Limits: Tag keys are limited to 255 characters; values can be up to 1000 characters.
Exact Term Matching for Searches: Tag search requires precise term matching for more accurate results.
Applying Tags in Unity Catalog: Step-by-Step Guide
1. Using the Catalog Explorer Interface
The Databricks Catalog Explorer is a user-friendly interface for applying and managing tags:
Navigate to the Catalog Explorer: Select the Catalog icon.
Choose the Object: Select the catalog, schema, table, or view we wish to tag.
Add or Edit Tags: Click Add Tags for new tags or Edit for existing ones.
Apply Tags to Columns: Tags can also be applied directly to table columns.
2. Using SQL Commands (Databricks Runtime 13.3+)
For more granular or automated tagging, SQL commands allow us to set tags across multiple objects:
Example Use Cases of Tagging
Cost Center Tracking
Tags help allocate storage and processing costs to specific departments:
Tag Key:
cost_center
Tag Value:
Marketing
Data Sensitivity Classification
For sensitive data, tags help enforce security and compliance requirements:
Tag Key:
classification
Tag Value:
PII
Environment Segmentation
Tags differentiate environments, making it easier to manage resources in Dev, QA, or Production:
Tag Key:
environment
Tag Value:
production
Searching for Tagged Objects in Unity Catalog
Unity Catalog supports tag-based search in the Databricks workspace:
How It Works: Input a tag key or key-value pair to filter search results.
Permission-Based Results: Only assets users have access to are shown, maintaining secure access control.
For example, a search query like classification:PII
will display only datasets tagged as containing sensitive information.
Retrieving Tag Information with INFORMATION_SCHEMA Queries
For programmatic tag retrieval, Unity Catalog includes INFORMATION_SCHEMA tables that store metadata across catalogs, schemas, tables, volumes, and columns.
Sample Queries:
Catalog Tags
Schema Tags
Table and View Tags
Volume Tags
Column Tags
These queries enable detailed reports for governance and compliance audits across Databricks resources.
Practical Tips for Implementing a Tagging Strategy
Start Small, Then Scale: Begin with a few essential tags like
cost_center
orproject
, gradually expanding to more specific use cases.Define Ownership: Assign a governance team to oversee tagging policies, maintaining tag accuracy and relevance.
Enforce Tags Using Policies: Set mandatory tags like
Department
orProject
to ensure key data assets are properly labeled.
Monitoring Costs with Tags: A Key Strategy for Cloud Cost Management
Tags can also be applied to Databricks resources like clusters and SQL warehouses, helping track costs by team or project. Default tags propagate to AWS EC2 instances created from a pool, streamlining cost attribution.
Example: Tagging clusters with Team:DataScience
can help finance teams track usage costs for that team.
Conclusion
Unity Catalog’s tagging features provide powerful tools for structured and efficient data management. CIOs, CDOs, and data teams benefit from streamlined data discovery, improved compliance, and actionable insights into cost drivers. By developing a clear tagging strategy and scaling it thoughtfully, organizations can maintain robust data governance, compliance, and cost control across complex data ecosystems.