For Large Data Set Ingestion — Advanced Cost Control Measures Are Required that Optimize Data Sets at Before or At Ingest

What approaches would work if you are constantly ingesting data, real-time or not, if you have petabytes of data or billions of rows it can add up and account for a lot of the cost in your Snowflake and Cloud implementations.

A recent client was ingesting a high volume of data that was repetitively processed daily, but changes in data updates existed even for records up to five years before.

Ultimately, you will need to try to reduce the initial data set or change how you are ingesting your data.

Tip 1: Snowpipe Streaming

For streaming, move to Snowpipe Streaming. It gives you considerable savings. One person I spoke with recently talked about the cost going from 500 dollars/day to 27 dollars/day when they moved from Snowpipe to Snowpipe Streaming. Snowflake Sales Engineers estimate Snowpipe is about 1/3 of the cost, but the example is much better.

If you are looking to learn how to implement Snowpipe Streaming, try this lab.

Tip 2: Optimize for Deltas

When dealing with large datasets resulting from multiple API calls, it’s better to optimize deltas, which is the difference between two subsequent load dates. Typically, only around 5% of the data changes daily, so instead of using the entire dataset for each load, we can analyze the changes from the previous load and use only the 5% of data that has changed. This approach can significantly improve the next steps in the data analysis process.

Even if it is not an API call, if the ingestion requires a comparison of tables, there you would still benefit from this tip. This is a large data set that may have to unpack and clean, or dedupe and merged.

Tip 3: Don’t keep the history in the intermediate tables

Often, historical data are kept in the intermediate-working tables. Our take is: don’t keep it for more than one reason. First, it will save storage. Secondly, it reduces the risk of an inefficient execution plan. Third, it provides an opportunity to use a better clustering key, as you no longer have to partition your data by load date.

Tip 4: File size for ingestion: 100–250 MB

It seems small, but following Snowflake’s recommendation for file size is a great help. It does seem to improve ingestion, as it makes any COPY INTO statements as efficient as possible. If you can control file sizes, it would be great. Smaller files reduce the overhead charge relative to the amount of total data loaded to the point where the overhead cost is immaterial. An efficient data pipeline is a cheaper pipeline.

Tip 5: Parquet format

The final tip not to be discounted is to use the Parquet format instead of the typical CSV.

Parquet provides an efficient compression with smaller file sizes and improved columnar storage. With better compression ratios and faster read times for analytical queries that access specific columns. Smaller file sizes mean faster transfer rates and reduced storage costs. With the optimized structure and smaller size, you can parallel process things.

Plus, you gain metadata, making your queries faster as well.

Conclusion

As we have explored, managing costs while ingesting extensive data sets into Snowflake requires a strategic data handling and processing approach. Organizations can significantly reduce their cloud and Snowflake costs by optimizing data ingestion through Snowpipe Streaming, focusing on deltas, and minimizing unnecessary historical data storage.

Incorporating the strategies mentioned in this blog can help you reduce expenses and improve the overall efficiency of your data management system. These tips should be viewed not only as cost-saving measures but also as essential components of a strong data strategy promoting scalable, efficient, and cost-effective Snowflake operations. To ensure your data processes remain lean and robust, encourage a culture of continuous learning and keep yourself updated with the latest methods and trends.

If you do need help, check out Infostrux Solutions approaches to Data Engineering.

I am Augusto Rosa, VP of Engineering for Infostrux Solutions. Snowflake Data Super Hero and SME. Thanks for reading my blog post. You can follow me on LinkedIn.

Subscribe to Infostrux Medium Blogs https://medium.com/infostrux-solutions for the most interesting Data Engineering and Snowflake news.

Sources:

https://www.snowflake.com/blog/faster-batch-ingestion-for-parquet/

Snowflake Advanced Cost Control: Ingest Optimizations