Part I of this article will detail how to streamline your data analytics with AWS Athena using Terraform Infrastructure as Code.
Requirements
- AWS account
- Terraform
Walkthrough
Configure AWS credentials with aws configure
.
At the root of our project, create a file named provider.tf
which we will specify the aws provider. Also we are going to read dynamically our columns and declare a variable for each column.
#provider.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.59.0"
}
}
}
Next, let's define an S3 bucket in a file named s3.tf
which we will use as a source of data.
#s3.tf
resource "aws_s3_bucket" "log_bucket" {
bucket = "my-tf-log-bucket-source-athena"
}
Now we will define a file named athena.tf
, in this file we are going to create an athena workgroup,db, table, and our query.
For our aws_glue_catalog_table we are expecting the table to serealize a json object.
Note: Is important to note that our json files should have no spaces, otherwise it won't parse well other alternative could be the use of LazySimpleSerDe which could convert from byte stream
#athena.tf
resource "aws_glue_catalog_database" "myservice_athena_db" {
name = "myservice"
}
resource "aws_glue_catalog_table" "athena_table" {
name = "mytable"
database_name = aws_glue_catalog_database.myservice_athena_db.name
description = "Table containing the results stored in S3 as source"
table_type = "EXTERNAL_TABLE"
storage_descriptor {
location = "s3://${aws_s3_bucket.log_bucket.bucket}/mydata"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"
ser_de_info {
name = "s3-stream"
serialization_library = "org.openx.data.jsonserde.JsonSerDe"
parameters = {
"ignore.malformed.json" = "TRUE"
"dots.in.keys" = "FALSE"
"case.insensitive" = "TRUE"
"mapping" = "TRUE"
}
}
dynamic "columns" {
for_each = var.columns
iterator = column
content {
name = column.value.name
type = column.value.type
}
}
}
}
resource "aws_athena_workgroup" "test" {
name = "example123123s"
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.log_bucket.bucket}/output/"
}
}
}
resource "aws_athena_named_query" "athena_query" {
name = "test_query"
workgroup = aws_athena_workgroup.test.id
database = aws_glue_catalog_database.myservice_athena_db.name
query = "SELECT json_extract_scalar(person, '$.name') as name, product as source FROM myservice.mytable;"
}
Our json example should look like this (but without spaces)
{
"product": "Live JSON generator",
"version": 3.1,
"releaseDate": "2014-06-25T00:00:00.000Z",
"demo": true,
"person": {
"id": 12345,
"name": "John Doe",
"phones": {
"home": "800-123-4567",
"mobile": "877-123-1234"
},
"email": [
"jd@example.com",
"jd@example.org"
],
"dateOfBirth": "1980-01-02T00:00:00.000Z",
"registered": true,
"emergencyContacts": [
{
"name": "Jane Doe",
"phone": "888-555-1212",
"relationship": "spouse"
},
{
"name": "Justin Doe",
"phone": "877-123-1212",
"relationship": "parent"
}
]
}
}
In our variables.tf we will dynamically call all the columns
#variables.tf
variable "columns" {
type = list(object({
name = string,
type = string,
}))
default = []
description = "The columns in the table, where the key is the name of the column and the value the type"
}
#terraform.tfvars
columns = [
{
name = "product"
type = "string"
},
{
name = "version"
type = "int"
},
{
name = "releaseDate"
type = "string"
},
{
name = "demo"
type = "boolean"
} ,
{
name = "person"
type = "string"
}
]
After finishing our files let's use terraform to create our resources.
terraform init
terraform plan
terraform apply
Update our data.json file through s3 commands or directly in the interface
aws s3 cp data.json s3://<<your_bucket_name>>/mydata/
Now in Athena service in our AWS account we can see our db, table and saved query, running our query should return a value like this.