Snowflake Built-in Quality Metrics

augusto kiniama rosa - Jun 17 - - Dev Community

Learn About Snowflake’s Latest Feature: Data Quality and data metric functions


Photo by Tool., Inc on Unsplash

In late March, Snowflake released to Public Preview their Data Quality and data metric functions (DMFs). These functions are available to Enterprise editions and above.

What are Data Quality and data metric functions?

Data metric functions are a serverless Snowflake service that ensures your data is of high quality and data metrics are logged. The benefits to these built-in functions are numerous, but let me stick one: you can now know your Service Level Agreement is being met without leaving Snowflake. No more third-party services if you do not want to pay for them.

There are two types of DMFs available — Snowflake-provided system DMFs and user-defined DMFs. You can use DMFs to keep an eye on the state and integrity of your data. They help you measure important metrics such as freshness and counts that identify duplicates, NULLs, rows, and unique values.

Data Quality refers to understanding the condition and reliability of your data, which encompasses data currency and precision regarding actual data values relative to empty or null fields in a given column. This is essential for making informed decisions based on data. You can evaluate the quality of your data using Data Management Frameworks (DMFs).

When using data metric functions (DMFs) in Snowflake, the results of calling the DMF are recorded in the event table for your account. You can assign a DMF to a table or view and specify the frequency of how often it is called. For example, you can schedule a DMF to run three times daily on a specific table and adjust the schedule to meet your data quality requirements. All DMFs set on the table follow the same schedule. Additionally, you can configure alerts after scheduling them.

By combining the functionalities of DMF and alerts, you can receive consistent notifications when the data quality thresholds of the tables you measure are breached.

DMFs provide good benefits to improving your data governance stance after you set your data quality rules, including helping stewards know the current state of their data based on specific metrics, helping engineers take immediate actions on necessary tables and views as soon as they detect any issues, or administrators ensuring data quality monitoring is done with cost, consistency, and performance.

Once you have established a data quality workflow that involves defining, measuring, and monitoring data, you can apply it to additional workloads.

In Public Preview, these are the available DMFs, although you can also create your own.

Here are the systems DMFs:

Freshness:

  • SNOWFLAKE.CORE.FRESHNESS — Determine the freshness of a column data
  • SNOWFLAKE.CORE.DATA_METRIC_SCHEDULED_TIME — define custom freshness metrics

Accuracy:

  • SNOWFLAKE.CORE.NULL_COUNT — number of NULL values in a column

Uniqueness:

  • SNOWFLAKE.CORE.DUPLICATE_COUNT—number of duplicate values in column
  • SNOWFLAKE.CORE. UNIQUE_COUNT— number of non-NULL values in a column

Volume:

  • SNOWFLAKE.CORE. ROW_COUNT — measure of count of rows in a table
  • SNOWFLAKE.CORE. UNIQUE_COUNT

Cost

It is a serverless function part of Snowflake Cloud Services and consumes credits in the “Data Quality Monitoring” category on your monthly bill. They are described in more detail in the Snowflake Consumption Table. These credits include compute consumed by all system or user-defined data quality metrics that you use.

You are not billed for creating a data metric function, but you are billing only if it is executed against an object or when the logging metrics consolidate in the event table.

Here is a script to find your cost:

USE ROLE DQ_ROLE;
SELECT *
FROM SNOWFLAKE.ACCOUNT_USAGE.DATA_QUALITY_MONITORING_USAGE_HISTORY
WHERE TRUE
AND START_TIME >= CURRENT_TIMESTAMP - INTERVAL '30 days' -- adapt the days to whatever you need
Enter fullscreen mode Exit fullscreen mode

Limitations

Note the following limitations when using DMFs:

  • Only regular tables are supported. You cannot set a DMF on any other kind of table, such as a dynamic table or external table.
  • Setting DMFs on objects:
  • You can only have 1000 total associations of DMFs on objects per account. Each instance of setting a DMF on a table or view counts as one association.
  • Data sharing:
  • You cannot grant privileges on a DMF to share or set a DMF on a shared table or view.
  • Setting a DMF on an object tag is not supported.
  • Using the CLONE operation and the CREATE TABLE … LIKE operation does not result in DMF assignments on the target object.
  • You cannot set a DMF on objects in a reader account.
  • You cannot create a dynamic table or a stream based on the data in the event table.

How do DMFs Work?

The DMFs are part of the CORE schema of the shared SNOWFLAKE database. This means it is all maintained by Snowflake and nothing much can be changed.

The way it work is that each system DMF measure a unique quality attribute and you assign the DMFs to a table or view, which in turn gives you data quality measurements.

For the access control step:

-- Access Setup
-- Create Role
USE ROLE ACCOUNTADMIN;
CREATE ROLE IF NOT EXISTS DQ_ROLE;

-- Grants
GRANT CREATE DATABASE ON ACCOUNT TO ROLE DQ_ROLE;
GRANT EXECUTE DATA METRIC FUNCTION ON ACCOUNT TO ROLE DQ_ROLE;
GRANT APPLICATION ROLE SNOWFLAKE.DATA_QUALITY_MONITORING_VIEWER TO ROLE DQ_ROLE;
GRANT DATABASE ROLE SNOWFLAKE.USAGE_VIEWER TO ROLE DQ_ROLE;
GRANT DATABASE ROLE SNOWFLAKE.DATA_METRIC_USER TO ROLE DQ_ROLE;

-- Create a Warehouse
CREATE WAREHOUSE IF NOT EXISTS DQ_WH;
GRANT USAGE ON WAREHOUSE DQ_WH TO ROLE DQ_ROLE;

-- See Grants
SHOW GRANTS TO ROLE DQ_ROLE;

-- Lets Grants SYSADMIN and ourselves the Role
GRANT ROLE DQ_ROLE TO ROLE SYSADMIN;
GRANT ROLE DQ_ROLE TO USER AUGUSTO;

--cleanup
DROP ROLE DQ_ROLE;
DROP WAREHOUSE DQ_WH;
Enter fullscreen mode Exit fullscreen mode

Let’s Test Against Some Data

I already had a table that I wanted to test, but if you look at the Snowflake lab in the source below there is a full lab. Lets do some rules and do schedule some testing.

Let’s first simply use some of the pre-created DMFs. You will notice, we always start by creating a schedule. This super flexible and it can use a cron or trigger on schedule.

-- Schedule Rules
ALTER TABLE DEV_INGEST.CUSTOMERS SET DATA_METRIC_SCHEDULE = '5 MINUTE';

-- Insert some fake data
SELECT * FROM DEV_INGEST.CUSTOMERS;
INSERT INTO DEV_INGEST.CUSTOMERS values 
('Iona Ruizw',380-739-5850,'phasellus.elit.pedexgoogle.org',60717,'Germany',218,'2023-04-13',2023), 
('',380-739-5851,'NULL',60717,'Germany',218,'2023-04-13',2023),
('Iona Ruizz',380-739-5850,'phasellus.elit.pede@google.org',60717,'Germany',218,'2023-04-13',2023),
('Iona Ruiz',380-739-5850,'NULL',60717, 'Germany',218,'2023-04-13',2023);

--Associate the System DMF to the Patient Table.
ALTER TABLE DEV_INGEST.CUSTOMERS add data metric function snowflake.core.null_count on (email);
ALTER TABLE DEV_INGEST.CUSTOMERS add data metric function snowflake.core.unique_count on (email);
ALTER TABLE DEV_INGEST.CUSTOMERS add data metric function snowflake.core.duplicate_count on (email);

--Run the DMF to obtain the metrics
SELECT SNOWFLAKE.CORE.NULL_COUNT(SELECT email FROM DEV_INGEST.CUSTOMERS);
SELECT SNOWFLAKE.CORE.UNIQUE_COUNT(SELECT email FROM DEV_INGEST.CUSTOMERS);
SELECT SNOWFLAKE.CORE.DUPLICATE_COUNT(SELECT email FROM DEV_INGEST.CUSTOMERS);

-- Another example
SELECT SNOWFLAKE.CORE.UNIQUE_COUNT(
  SELECT
    phone
  FROM CUSTOMERS
);
Enter fullscreen mode Exit fullscreen mode

Let’s dig into a custom DMF:

-- Custom DMF - hour data freshness
USE DATABASE DEV_INGEST;
USE SCHEMA CUSTOMERS;
CREATE OR REPLACE DATA METRIC FUNCTION data_freshness_hour(
  ARG_T TABLE (ARG_C TIMESTAMP_LTZ))
  RETURNS NUMBER AS
  'SELECT TIMEDIFF(
     hour,
     MAX(ARG_C),
     SNOWFLAKE.CORE.DATA_METRIC_SCHEDULED_TIME())
   FROM ARG_T';

-- Run manually 4 hours range
SELECT data_freshness_hour(SELECT last_updated FROM DEV_INGEST.CUSTOMERS) < 4;

-- See what data metrics you have, anything in CORE schema, is a snowflake system metric
SHOW DATA METRIC FUNCTIONS IN ACCOUNT;
Enter fullscreen mode Exit fullscreen mode

There are other options, including making the DMF secure for example, or add a Tag.

Cleanup all these scripts.

--cleanup
DROP ROLE DQ_ROLE;
DROP WAREHOUSE DQ_WH;

-- Unset DMF
ALTER TABLE customers DROP DATA METRIC FUNCTION
  data_freshness_hour ON (email);
Enter fullscreen mode Exit fullscreen mode

Conclusion

Although this is still in Public preview and the number of built-in rules is relatively small, it is still a powerful feature pre-built into Snowflake. I give kudos to continuing to build excellent data governance and data management features into the platform. Snowflake Data Cloud is already a strong offer, but these minor updates make it even stronger.

I would love to see more pre-built quality roles, the ability to give ownership over rules to data quality owners, a more straightforward implementation process and integration into Snowsight.

I am Augusto Rosa, VP of Engineering for Infostrux Solutions. Snowflake Data Super Hero and SME. Thanks for reading my blog post. You can follow me on LinkedIn. Subscribe to Infostrux Medium Blogs https://medium.com/infostrux-solutions for the most interesting Data Engineering and Snowflake news.

Sources:


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .