Where data comes from
Modern applications generate data in many forms as customers use the applications.
- log files
- transaction records (order confirmations, emails, etc.)
The systems that work with this data are referred to as OLTP systems (Online Transaction Processing) and they are optimized to write data quickly (append rows) and bring specific records/rows like a customer or order record.
Eventually, you are going to want to analyze that data which can initially be done with traditional databases using SQL but eventually the data will get too large and the systems will take to long to run queries.
OLAP - Online Analytics Processing
OLAP systems are designed to handle analytical workloads which often involved reading lots of data, filtering of columns and so forth. Data in these systems will use columnar formats instead of row based ones for faster analytical performance. This may come in the form of file formats (Apache Parquet),in-memory formats (Apache Arrow) and data transport (Apache Arrow Flight) that the best of these systems are built around.
ETL (Export Transform Load)
So our data generated by our OLTP systems need to be moved over to our OLAP systems, that means we are going to need to:
- Export the data from the source
- Transform and prepare the data
- Load it into its destination
It's the job of data engineers to architect, implement and maintain these ETL pipelines.
Batch Data
One way to do our ETL is to do so periodically, like loading all last weeks sales data at the end of each week. This is referred to as batch processing but the downside is that data consumers like data analysts and data scientists don't have real-time data which delays the time to insight.
Streaming Data
To have real-time data delivered to your data consumers you can use data streaming. Data steaming works by having continous pipelines that constantly export new data, apply neccessary transforms and load it in the target destination where consumers can begin using the data. Loading or Ingesting streaming data can be a little bit trickier than batch data, but gives you access to real time data.
Summary
Data needs to be moved from OLTP systems into OLAP systems for data consumers, so data engineers need to ETL that through ingesting data in batches or streams.