Introduction
You may be a data scientist, business analyst or data analyst familiar with loading data from Amazon S3 into Amazon Redshift using the COPY command, at AWS re:invent 2022 to help AWS customers move towards a zero-ETL future without the need for a data engineer to build an ETL pipeline, data movements can be simplified with auto-copy from Amazon S3 into Amazon Redshift.
Lesson Outcome
- In this lesson you will learn how to ingest data with auto-copy from Amazon S3 into Amazon Redshift.
What is Amazon Redshift Auto-Copy from Amazon S3 ?
Amazon Redshift auto-copy from Amazon S3 simplifies data ingestion from Amazon S3.
How does this feature work?
Auto-copy from Amazon S3 is a simple, low code data ingestion that automatically loads new files that are detected in your S3 bucket into Amazon Redshift.
New files can be ingested in the formats csv, json, parquet or avro.
You may store a COPY command in a COPY job in Amazon Redshift which will detect new files stored in Amazon S3 and load the data into your table.
What are the benefits?
You can save time for your team by avoiding manual uploads of new data from an S3 bucket into Amazon Redshift with COPY statements.
COPY jobs will be able to detect data that was previously loaded in the data ingestion process.
COPY jobs can prevent duplicated data when an automated job is not required.
COPY jobs can be manually created to reuse copy statements
Solution Architecture
In this high-level architecture, new files are ingested into a single Amazon S3 bucket into separate folders and a copy job inserts the data into the corresponding Table A and Table B.
Prerequisites
Download an interesting open source dataset and save it in your directory. This tutorial includes data from Kaggle at this link
Login into your AWS account as an IAM Admin user. You may view the blog to create an account.
Create an Amazon S3 bucket. You may view this blog to create your first S3 bucket.
Use AWS Glue to create a crawler to inspect the meta data in the data catalogue table. You may follow this blog to learn how to create a crawler.
Dataset
The World Happiness Report dataset is available to download from Kaggle here. There are five datasets ranging from 2015 to 2019
Data Description
There are 12 columns in each csv file:
- Country
- Region
- Happiness Rank
- Happiness Score
- Standard Error
- Economy - GDP per capita
- Family
- Health - Life Expectancy
- Freedom
- Trust - Government Corruption
- Generosity
- Dystopia
Using AWS Glue to create a crawler to inspect the schema in one file from the data catalogue:
Tutorial 1: Getting started with auto-copy files from Amazon S3(Preview) into Amazon Redshift
For this tutorial we will follow the steps outlined in the Amazon Redshift Database Developer Guide
Step 1: Sign into the AWS Management Console with your AWS IAM Admin User account details.
Step 2: Navigate to Amazon Redshift by typing 'Redshift' in the search bar.
Step 3: Create a cluster in Preview by selecting Provision clusters dashboard
In this preview, I will use the AWS region US-East (Ohio) (us-east-2).
Step 4: Click Create preview cluster in the banner.
Step 5: Create a unique name to identify your cluster in lowercase letters.
Next select from the drop-down menu a Preview track e.g. Preview 2022 to allow you to test the new features.
Step 6: Inspect the size of your data or estimate the size.
Step 7: Enter other configurations for your cluster. I selected Help me choose. I have selected additional options to indicate that my estimate is for compressed data in Amazon Redshift and my data is not time-based.
The calculated configuration summary provided recommendations for 2 nodes and compute size using dc2.large.
Step 8: Under Database Configurations, create an admin user password that includes at least one Uppercase letter and a number.
Step 9: Under Cluster Permissions, I created a new IAM role that has AmazonRedshiftAllCommandsAccess policy attached to allow you to perform the following SQL queries such as COPY, UNLOAD and SELECT. The new IAM role can access any S3 bucket and it replaces the default IAM role.
Type 'IAM' in the navigation search bar and click Roles create a new role.
Add additional permissions to access Query Editor 2, Amazon S3 and Amazon Redshift Full Access.
Provide a name of the new service role e.g.'redshift-autocopys3-custom-role'
Add permissions and then click Create role.
Confirm that the new service role has been created in IAM.
Step 10: Click Create cluster in preview, it will take a few minutes to provision. (Note: do not use the preview cluster for your production workloads).
The preview cluster has been created successfully.
Step 11: On the navigation pane on the left, click Query editor v2
Step 12: Double click on the preview cluster (i.e.database) and open the arrow to inspect the tables.
You can inspect the new preview cluster that has been created.
Step 13: Create an Amazon S3 bucket with a globally unique name e.g. world-hapiness-report using default settings and click Create bucket.
Step 14: Create a single folder in the Amazon S3 bucket for automated file ingestion e.g. 's3-autocopy-file'
Create a subfolder named e.g.'datasets'
Upload five csv files.
Tutorial 2: Load data Amazon Redshift and query data.
Step 1: Create a table in Amazon Redshift.
The following SQL code creates the world_happiness table:
Create table world_happiness (
country VARCHAR(100),
region VARCHAR(100),
happiness_rank int,
happiness_score float4,
standard_error float4,
economy float4,
family float4,
health float4,
freedomm float4,
trust float4,
generosity float4,
dystopia float4)
;
Step 18: Create a COPY job to automatically load csv files from Amazon S3 path into the world happiness table in Amazon Redshift.
COPY world_happiness
FROM 's3://world-happiness-report/s3-autocopy-file/datasets/published_reports'
IAM_ROLE 'arn:aws:iam::***********:role/redshift-autocopys3-custom-role'
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
IGNOREHEADER 1
JOB CREATE job_world_happiness1 AUTO ON;
Final Notes
Autocopy files from Amazon S3 to Amazon Redshift should be used for test workloads only during public preview.
After you have finished testing, please delete your provisioned Amazon Redshift cluster to avoid billing charges.
IAM role does not support default during public preview. You may read other best practice guidelines in this blog.
AWS re:Invent 2022
If you missed the keynote from Dr Swami Sivasubramanian, VP of Data and Machine Learning at Amazon Web Services you may watch on demand and also keep informed of the other keynotes here.
If you would like to watch the announcement of the auto-copy files from Amazon S3 (Preview) into Amazon Redshift you may fast forward to the timing at 1 hour and 18 minutes.
Reference
What's new next week? 🌎
- AWS Innovate - Data and AI/ML edition - Americas on 14 March including Pi Day celebrations join the party to celebrate 17 years of launching Amazon S3 and have some delicious🥧
Register for AWS Pi Day at this link
Until the next lesson, happy learning! 🙂