- ran into CDC in a brilliand podcast interview with the Hudi creator Vinod Chandar — https://www.youtube.com/watch?v=PMBvrJNG6rU
- looked into another video where solution's described in a Databricks's conference — https://youtu.be/3-8_EmqTG4k?si=pCJyitZeJUrgM6Dj&t=465
def
- around for 10-15 years or so but what is new is the demand for change
- CDC is the technology that turns a database change log into a stream of events
- kind of like a time series data, but with the change that happens to the database itself
- every single write every, single insert update captured in the change log of the database
- we have something that reads the database log
- not the database directly — we're not querying the tables and using database resources that way
- we're looking at the log file of the database
- turning that into a stream of events that then you can do stuff with
use
- backup and disaster recovery
- you can create exact point in time replicas of the database
- this that batch data loading misses is— changes that happen in between the frequencies of the snapshots right so if you're trying to do something like fraud detection or you're
- trying to detect or train a machine learning model on on sort of real world data sets and data
- things can happen at a higher frequency than your snapshot in the database and you'd miss that with change data capture you get all of that and there's tons of