Jul 21, 2024
Migrating to Databricks Unity Catalog with UCX: A Comprehensive Guide
Migrating to Databricks Unity Catalog with UCX: A Comprehensive Guide
The migration to Databricks Unity Catalog (UC) represents a significant step forward for organizations seeking to enhance their data governance and security. The process, while intricate, can be streamlined with the right approach and tools. This guide will walk you through the migration process using UCX (Unity Catalog eXchange), a powerful set of workflows and commands that facilitate a smooth transition.
Overview of the Migration Process
At a high level, the migration process comprises four primary steps:
Assessment Workflow: Evaluate the current state and compatibility of your workspace.
Group Migration: Transition workspace-local groups to account-level groups.
Table Migration: Upgrade Hive metastore objects to Unity Catalog.
Code Migration: Update and migrate any necessary code.
Each of these steps involves specific tasks and workflows designed to ensure a comprehensive and effective migration.
Step 1: Assessment Workflow
The assessment workflow is the initial step in the migration process, designed to evaluate the compatibility of your current workspace entities with Unity Catalog. This step identifies any incompatible entities and provides the necessary information for planning the migration.
The assessment workflow can be triggered using the Databricks UI, or via the command line.
Key Tasks in the Assessment Workflow
Crawl Tables: Scans all tables in the Hive Metastore, persisting their metadata in a Delta table.
Crawl Grants: Retrieves and stores permissions for each table.
Estimate Table Size for Migration: Assesses the size of tables that need to be cloned.
Crawl Mounts: Compiles a list of all existing mount points.
Guess External Locations: Identifies external locations necessary for a successful migration.
Assess Jobs, Clusters, Pipelines, and Azure Service Principals: Evaluates these entities for compatibility.
Assess Global Init Scripts: Identifies Azure Service Principals in global init scripts.
After executing the assessment workflow, an assessment dashboard is populated with findings and recommendations.
Step 2: Group Migration Workflow
Before starting the group migration workflow, ensure that the assessment workflow has been completed. This workflow upgrades all Databricks workspace assets and migrates workspace-local groups to account-level groups in the Unity Catalog environment.
Key Tasks in the Group Migration Workflow
Crawl Groups: Scans all groups for the local group migration scope.
Rename Workspace Local Groups: Adds a
ucx-renamed-
prefix to avoid conflicts.Reflect Account Groups on Workspace: Adds matching account groups to the workspace.
Apply Permissions to Account Groups: Assigns full permissions from the original group to the account-level group.
Validate Groups Permissions: Ensures all permissions are correctly applied.
Delete Backup Groups: Removes workspace-level backup groups and their permissions.
This workflow ensures that all necessary groups are available in the workspace with the correct permissions, and removes any unnecessary groups and permissions.
Step 3: Table Migration Workflow
The table migration workflow involves upgrading Hive metastore objects to Unity Catalog using UCX. The process is composed of multiple steps, each handling different types of metastore objects.
Prerequisites
UCX must be installed and configured on the workspace.
The assessment workflow must be run.
Group migration should be completed.
The workspace should be configured with a Unity Catalog metastore.
Steps in the Table Migration Workflow
Mapping Metastore Tables: Create and update a mapping file for metastore tables.
Create the Mapping File: Using the
create-table-mapping
command.Update the Mapping File: Modify mappings as needed.
Example:
Create Cloud Principals for the Upgrade:
Map Cloud Principals to Cloud Prefixes: Use the
principal-prefix-access
command.Create/Modify Cloud Principals and Credentials: Create necessary cloud principals for UC credentials.
Create External Locations: For each location identified in the assessment.
Create Uber Principal: A principal that has access to all external table locations.
Create Catalogs and Schemas: Using the
create-catalogs-schemas
command.
Upgrade the Metastore:
EXTERNAL_SYNC: Use the
sync
SQL command.EXTERNAL_HIVESERDE: Use either the
CTAS
workflow or the in-place migration.EXTERNAL_NO_SYNC: Create a new managed table in UC and copy data.
DBFS_ROOT_DELTA: Use the
deep clone
command.DBFS_ROOT_NON_DELTA: Use the
CTAS
method.VIEW: Recreate views in UC, ensuring dependencies are migrated first.
Post Migration Data Reconciliation Task:
Validate the integrity of migrated tables using the
migrate-data-reconciliation
workflow.
Once the workflow completes, the output will be stored in
$inventory_database.reconciliation_results
view, and displayed in the Migration dashboard.
Step 4: Code Migration
The final step in the migration process involves updating and migrating any necessary code. This includes ensuring that scripts, notebooks, and other code assets are compatible with Unity Catalog and leverage its capabilities.
Additional Considerations
Debugging and Logs: Each workflow run stores debug logs in the logs folder. Enable debug logs with the
-debug
flag for CLI commands.Skipping and Moving Objects: Commands like
databricks labs ucx skip
anddatabricks labs ucx move
help manage objects during migration.Reverting Objects: Use the
databricks labs ucx revert-migrated-tables
command to revert objects if needed.Reconciliation Threshold: Adjust the reconciliation threshold if needed to ensure accurate data validation.
Cluster Configuration: Optimize cluster configuration for tasks like deep cloning large Delta tables.
By following this structured approach and utilizing the capabilities of UCX, organizations can achieve a seamless and efficient migration to Databricks Unity Catalog, enhancing their data governance and security posture.
Summary
Migrating to Databricks Unity Catalog involves a multi-step process that includes assessment, group migration, table migration, and code migration. Utilizing the UCX workflows and commands simplifies this complex process, ensuring compatibility, proper permissions, and data integrity. This guide provides a comprehensive overview to help you navigate the migration with confidence, ultimately leveraging the full potential of Unity Catalog for improved data management and governance.