This document will cover the steps on how to use Glue crawler to extract data from S3 to automatically add tables to the glue DB and run queries on it from Dremio or Athena
Setup Diagram
Steps to follow
Create an S3 bucket and upload the raw data i.e, csv, json files.
Go to AWS Glue Console and Create Glue DB
Go to Tables page and Select Add Tables using crawler on the top right corner
This should land you to the AWS Glue Crawler setup page
Follow below steps to fill in the details
- Name - Enter the Crawler name
-
Add data source
- Data source - Select S3
- Location of S3 data - Select In this account (if that’s the case)
- S3 path - Browse for the S3 bucket which contains the data and don’t forget to add forward slash at the end
- Subsequent crawler runs - Select Crawl all sub-folders
Click Add an S3 data source
Click Next → Configure security settings
Click Create new IAM role and give a name to the role. It will create a new IAM role required by the Glue crawler to extract the data present in the S3 bucket
-
Next, Set output and scheduling
- Select the Target Database - you can choose default or create a new one
- Crawler schedule - On Demand
Next → Review and Create → Create Crawler
Now, the crawler has been successfully created and you can run the crawler
It will take few minutes to extract the data from S3 bucket and once it is done, you should see the state as Ready
Now, you should be able to see a table added in the glue DB
- Go to Dremio → Add the glue catalog as a source
- Name - Enter glue catalog name
- Region - Select the AWS region
- Authentication - AWS Access key
Click Save and run queries on the glue DB from Dremio! or Athena