Read Unity Catalog tables in Azure Databricks from Snowflake

via Iceberg

Oct 12, 2024

This article is inspired by the work of Randy Pitcher and all credits go to him

How-to-Read-Databricks-Tables-from-Snowflake

The need:

Many customers have both Databricks and Snowflake and move data between them. Iceberg helps to bring true interoperability of data without any copies.

Prerequisites:

Databricks
- Databricks Runtime 14.3 LTS or above
- The table(s) must have column mapping enabled
- Disable deletion vectors on the target table(s)
Snowflake
- Permissions to create External Volume, Catalog integration & Iceberg table

Steps:

Databricks

If you have enabled deletion vectors, use following command to make it Uniform compatible
```
REORG TABLE table_name APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2));
```

Else enable Uniform using below command

ALTER TABLE table_name SET TBLPROPERTIES(
  'delta.columnMapping.mode'='name',
  'delta.enableIcebergCompatV2' = 'true',
  'delta.universalFormat.enabledFormats' = 'iceberg'
);

Get Unity catalog metastore location
Create a personal access token and have it handy

Azure
1. Get azure tenant id. You can use Get-AzTenant or get it from Entra UI

Snowflake

Create external volume

CREATE EXTERNAL VOLUME az_uc_metastore
  STORAGE_LOCATIONS =
    (
      (
        NAME = 'az_uc_metastore'
        STORAGE_PROVIDER = 'AZURE'
        STORAGE_BASE_URL = 'azure://<UC metastore location>'
        AZURE_TENANT_ID = 'AzureTenantID'
      )
    );

Grant Snowflake access to the storage location

Run below query to get Service principal name and Azure consent URL
Navigate to Azure consent URL and Accept

Grant “Storage Blob Data Contributor“ at the storage account level NOT container

describe external volume <external_volume_name>;
select 
SPLIT_PART(parse_json("property_value"::string):"AZURE_MULTI_TENANT_APP_NAME"::string,'_', 1) as SERVICE_PRINCIPAL_NAME,
parse_json("property_value"::string):"AZURE_CONSENT_URL"::string as AZURE_CONSENT_URL
from table(result_scan(last_query_id()))
WHERE "property_value" like '%STORAGE%';

Create Catalog integration to UC

CREATE OR REPLACE CATALOG INTEGRATION az_unity_catalog
  CATALOG_SOURCE = ICEBERG_REST
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = 'silver' --Schema name in databricks
  REST_CONFIG = (
    CATALOG_URI = 'https://<Databricks-WorkspaceURL>/iceberg'
    WAREHOUSE = 'dev' --Catalog name in databricks
  )
  REST_AUTHENTICATION = (
     TYPE = BEARER
     BEARER_TOKEN = 'PersonalAccessToken'
  )
  ENABLED = TRUE;

Create Iceberg table

CREATE OR REPLACE ICEBERG TABLE 
	products
	EXTERNAL_VOLUME = 'az_uc_metastore'
	CATALOG = 'az_unity_catalog'
	CATALOG_TABLE_NAME = 'products';

Query & Refresh:

Query table in Snowflake as usual
Run refresh statement to reflect changes in Databricks
```
ALTER ICEBERG TABLE products_uc_dbr REFRESH;
```

Outcome:

One copy of data to work with and easier maintenance

Things to note:

Metadata generation takes few seconds post write in Databricks. I’ve noticed 4 seconds, if write/merge is the last command in your notebook test the duration before using for Production.
This can be found in ‘converted_delta_timestamp property in table properties UI

Table schema and data used in above demo:

create table products(
  id int,
  product_name string,
  product_category string
)

--First run
INSERT INTO products
VALUES
(1, 'Portronics Luxcell Wireless Mini 10K', 'PowerBank')

--Run after creating table in Snowflake
INSERT INTO products
VALUES
(2, 'Anker PowerCore 20100mAh', 'PowerBank'),
(3, 'Mi 20000mAh Li-Polymer Power Bank 3i', 'PowerBank'),
(4, 'URBN 10000 mAh Li-Polymer Ultra Compact Power Bank', 'PowerBank'),
(5, 'Syska 10000 mAh Li-Polymer P1016B Power Pocket100 Power Bank', 'PowerBank'),
(6, 'Ambrane 15000mAh Li-Polymer Powerbank with Type C and USB Ports', 'PowerBank'),
(7, 'Croma 10000mAh Li-Polymer Power Bank with 12W Fast Charge', 'PowerBank'),
(8, 'Philips 11000 mAh Power Bank', 'PowerBank'),
(9, 'Realme 10000mAh Power Bank 2', 'PowerBank'),
(10, 'OnePlus 10000 mAh Power Bank', 'PowerBank');

Kannan’s Substack

Discussion about this post

Ready for more?