TLDR
AWS S3 is a widely used option for data lake, let’s see how Singer helps Data Engineers to sync data from AWS S3 (source) to Postgres Database (destination).
Outline
- What’s AWS S3?
- Why to use Singer to sync data from different Data Sources?
- Step by step process to sync data from AWS S3 (source) to Postgres Database (destination)
- Conclusion
What’s AWS S3?
Amazon Web Services (AWS) S3, or Simple Storage Service, is a scalable, durable, and highly available object storage service (File System) provided by Amazon Web Services. It’s designed to store and retrieve any amount of data from anywhere on the internet, making it an ideal choice for various use cases, including data backup, archiving, big data analytics, content distribution, and more.
S3 offers a simple web service interface, allowing users to store and retrieve data using API calls. The service is organized into "buckets" which are essentially containers for objects (files). These objects are identified by unique keys and can be managed with metadata, access control, and versioning.
Key features of AWS S3 include:
- Scalability: S3 can store an unlimited amount of data and automatically scales as your storage needs grow.
- Durability: It offers 99.999999999% durability, meaning the risk of losing data is extremely low.
- Availability: S3 provides 99.99% availability of objects over a given year.
- Security: It supports various security features, such as encryption, access control, and logging.
- Cost-effective: With its pay-as-you-go pricing model, you only pay for the storage and data transfer you use.
Why use Singer to sync data from different Data Sources?
Singer is an open-source framework for data ingestion, which provides a standardized way to move data between various data sources and destinations (such as databases, APIs, and data warehouses). Singer offers a modular approach to data extraction and loading by leveraging two main components: Taps (data extractors) and Targets (data loaders). This design makes it an attractive option for data ingestion for several reasons:
- Extensibility: Growing library of pre-built Taps and Targets, ability to develop custom components
- Flexibility: Mix and match Taps and Targets for customized data pipelines
- Standardization: Enforces JSON-based format for consistent communication between components
- Ease of use: Simple setup and configuration with configuration files
- Community-driven: Benefits from contributions and improvements by a large community
- Language-agnostic: Supports creation of Taps and Targets in different programming languages
Step-by-step process to sync data from AWS S3 (source) to Postgres Database (destination)
To sync data from AWS S3 to downstream systems using Singer, you will need to use a Singer Tap for AWS S3 and a Singer Target for your specific downstream system. Here's a step-by-step guide on how to set up a data pipeline using Singer:
1. Install Singer:
First, you need to install Singer. It's recommended to use a virtual environment for your Python project. You can install Singer using pip:
pip install singer-python
2. Choose a Tap for AWS S3:
At the time of writing, there isn't an official Singer Tap for AWS S3. However, there are some community-contributed Taps that might work for your use case. One example is "tap-s3-csv" which reads CSV files from S3:
pip install tap-s3-csv
For other file formats or custom Taps, you can search the Singer community or create your own Tap.
3. Choose a Singer Target:
Select a Singer Target suitable for your downstream system. For example, if you want to sync data to a PostgreSQL database, you can use "target-postgres":
pip install target-postgres
4. Configure the Tap:
Create a configuration file for the Tap (e.g., tap_config.json). This file should contain the necessary AWS S3 and source file details:
{
"aws_access_key_id": "your_aws_access_key_id",
"aws_secret_access_key": "your_aws_secret_access_key",
"bucket": "your_s3_bucket_name",
"start_date": "2023-01-01T00:00:00Z",
"file_format": "csv"
}
Replace the placeholders with your actual AWS credentials, bucket name, and other details.
5. Configure the Target:
Create a configuration file for the Target (e.g., target_config.json). This file should contain the connection details for your downstream system. For example, for target-postgres, it would look like:
{
"postgres_host": "your_postgres_host",
"postgres_port": 5432,
"postgres_database": "your_postgres_database",
"postgres_username": "your_postgres_username",
"postgres_password": "your_postgres_password",
"postgres_schema": "your_postgres_schema"
}
Replace the placeholders with your actual PostgreSQL connection details.
6. Run the Tap and Target:
Finally, run the Tap and Target together in your terminal, connecting their input and output using a pipe:
tap-s3-csv -c tap_config.json | target-postgres -c target_config.json
This command will run the S3 Tap with the provided configuration, read the data from the specified S3 bucket, and send it to the PostgreSQL Target, which will load the data into the specified database.
Remember that the exact configuration and execution commands may vary depending on the specific Tap and Target you choose. Always refer to the documentation of the chosen components for detailed instructions and available options.
Conclusion
In conclusion, using Singer to ingest data from AWS S3 to PostgreSQL offers a flexible, extensible, and easy-to-implement solution for data pipeline creation. By combining a suitable S3 Tap with a PostgreSQL Target, and configuring them with JSON files, users can efficiently transfer and sync data between these systems. This approach benefits from Singer's growing ecosystem of Taps and Targets, making it an effective choice for handling diverse data ingestion needs.
In episode 3 of the data sources series, we’ll see how to use Mage for data ingestion.
Link to original blog: https://www.mage.ai/blog/data-sources-ep-2-aws-s3-to-postgres-data-sync-using-singer