Open table format and its architecture
OpenTable formats, such as Apache Iceberg, Apache Hudi, and Delta Lake, have gained popularity in the data analytics mainly because:
- ACID Transactions: OpenTable formats (e.g., Apache Iceberg, Delta Lake) ensure reliable and consistent data updates, even with concurrent access.
- Schema Evolution: They allow seamless updates to schemas without disrupting existing pipelines, simplifying data management. metadata tracks the changes to the dataset. The files held in the Data layer are captured by the metadata files held in the Metadata layer. As the files change, the metadata files attached to them track these changes.
- Optimized Queries: Partitioning and indexing enable faster queries by scanning only relevant data, improving performance and cost-efficiency.
- Time Travel: Users can access historical versions of data for debugging, compliance, or analytics.
- Interoperability: These formats integrate seamlessly with big data tools like Spark, Flink, and Presto, making them versatile and widely adopted.
Open file format
S3 table
Key Features
Amazon S3 Table is optimized for analytics workloads. It is designed to continuously enhance query performance and reduce storage costs for tabular data. This solution looks very promising if you are working with LakeHouse architecture. It’s a new type of bucket that organizes tables as sub-resources.
A new bucket type s3 table has been introduced to support this. As liked any other aws resoyrce, it has ARN, can take resource policy and as an unique feature it has dedicated endpoint.
- S3 Tables are intended explicitly for storing data in a tabular format, such as daily purchase transactions, streaming sensor data, or ad impressions. This data is organized into columns and rows like a database table.
- Table buckets support storing tables in the Apache Iceberg format. You can query these tables using standard SQL in query engines that support Iceberg.
- Read/write allowed on datafiles and metadata files. Delete and update not allowed to save data integrity.
- Compatible query engines include Amazon Athena, Amazon Redshift, and Apache Spark.
- S3 Table automatically performs maintenance tasks like compaction and snapshot management to optimize your tables for querying, including removing unreferenced files.
- S3 Table offers access management for both table and bucket
- Fully managed apache icebarg tables in S3
- It supports automatic compaction of underlying files to improve query performance and tune then further for better latency.
S3 Table buckets namespace
Namespace logically groups related s3 table together and thus allowing us to have greater control based on namespace of s3 tables. It helps us for following:
- logical segmentation of data and multi tenancy
- supporting of multi tenancy by having separate namespace. Supports compliance with data isolation requirements in regulated industries.
- separate tables based on application, project etc
- prevent naming conflicts
- Each namespace acts like a "container," allowing tables with the same name in different namespaces without conflicts.
- Better Access Control
- Policies can grant or restrict access to specific namespaces, ensuring data security and compliance. It also reduces the risk of unauthorized access to unrelated tables in the same bucket.
- Easy data management
- Makes our life easier to query, update, or delete related tables in bulk.
- Makes easy metadata management for tables grouped under a namespace.
- Advanced workflows based on namespace
- It helps to simplify automation for data pipelines or real-time analytics applications.
S3 table opertaion & management
Table Operation
They are quite similar to CRUD operation.
- list tables
- create tables
- Get table metadata location
- Update table metadata location
- Delete Table
Table Management
- Put Table Policy
- Put Table Bucket Policy
- Put Table Maintenance Config
- Put Table Bucket Maintenance Config
Policies related to S3 table operation
Allow access to create and use table buckets
Here Action Lists the specific actions the policy allows.
These actions are S3 Tables-specific:
- s3tables:CreateTableBucket: Grants permission to create a table bucket in S3 Tables.
- s3tables:PutTableBucketPolicy: Allows setting or updating the bucket policy for a table bucket.
- s3tables:GetTableBucketPolicy: Allows retrieving the bucket policy associated with a table bucket.
- s3tables:ListTableBuckets: Allows listing all table buckets within the specified scope.
s3tables:GetTableBucket: Grants permission to access the metadata of a specific table bucket.
Resource Defines the scope of the resources these actions can apply to."arn:aws:s3tables:region:account_id:bucket/*": Specifies all table buckets in the account (account_id) and region (region).
The * after bucket/ indicates that permissions apply to all buckets under this account and region.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "AllowBucketActions for user",
"Effect": "Allow",
"Action": [
"s3tables:CreateTableBucket",
"s3tables:PutTableBucketPolicy",
"s3tables:GetTableBucketPolicy",
"s3tables:ListTableBuckets",
"s3tables:GetTableBucket"
],
"Resource": "arn:aws:s3tables:region:account_id:bucket/*"
}]
}
Allow access to create and use tables in a table bucket
Here Action Lists the specific actions allowed by the policy, related to S3 Tables. Please note that The first policy focused on creating and managing table buckets and associated metadata, but it did not include granular operations like managing tables within namespaces. The first policy did not include actions such as creating tables, querying data, or updating metadata at the table level. These are the operations where namespaces become relevant.
- s3tables:CreateTable: Allows creating new tables in the specified table bucket.
- s3tables:PutTableData: Grants permission to write data to tables within the table bucket.
- s3tables:GetTableData: Allows reading data from tables in the bucket.
- s3tables:GetTableMetadataLocation: Allows retrieving metadata location information for a table.
- s3tables:UpdateTableMetadataLocation: Grants permission to update the metadata location of a table.
- s3tables:GetNamespace: Allows retrieving namespace information associated with the table bucket.
- s3tables:CreateNamespace: Grants permission to create namespaces for organizing table data.
Resource section specifies
- Grants permissions on the bucket named amzn-s3-demo-table-bucket
- Grants permissions on all tables within the amzn-s3-demo-table-bucket
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowBucketActions",
"Effect": "Allow",
"Action": [
"s3tables:CreateTable",
"s3tables:PutTableData",
"s3tables:GetTableData",
"s3tables:GetTableMetadataLocation",
"s3tables:UpdateTableMetadataLocation",
"s3tables:GetNamespace",
"s3tables:CreateNamespace"
],
"Resource": [
"arn:aws:s3tables:region:account_id:bucket/amzn-s3-demo-table-bucket",
"arn:aws:s3tables:region:account_id:bucket/amzn-s3-demo-table-bucket/table/*"
]
}
]
}
Table bucket policy to allows read access to the namespace
This policy allows to read s3 tables from a namespace. Here Action Lists the specific actions allowed by the policy, related to S3 Tables.
- s3tables:GetTableData: Allows reading data from tables in the bucket.
- s3tables:GetTableMetadataLocation: Allows retrieving metadata location information for a table. The resource section allows all s3 tables under bucket amzn-s3-demo-table-bucket1 but then s3tables:namespace restrict to only hr related s3 tables.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/Jane"
},
"Action": [
"s3tables:GetTableData",
"s3tables:GetTableMetadataLocation"
],
"Resource":{ "arn:aws:s3tables:region:account_id:bucket/amzn-s3-demo-table-bucket1/table/*”}
"Condition": {
"StringLike": { "s3tables:namespace": "hr" }
}
]
}
S3 table automatic maintenance
It provides automated maintenance through configurations that help simplify table management, optimize performance, and reduce operational overhead.
- Table Lifecycle Management
- we can add S3 Table configurations that includes lifecycle policies that automatically handle data expiration, transitions, or archival.
- automatic snapshot expiration can be configured easily.
- Data Compaction
- S3 Tables automatically compact small files (often produced by incremental writes) into larger, optimized files. It helps to have faster query and reduce storage cost.
- Schema Evolution
- Automated checks ensure compatibility between new and existing data.
- Metadata Optimization
- Indexing of metadata for faster querying and retrieval of table details.
All these can be policy based configuration.
Policy for snapshot management
By configuring the maximumSnapshotAge, we can specify the retention period for table snapshots. The following example ensures S3 Table will automatically retain only the snapshots from the last 30 days
- MinimumSnapshots: Ensures that at least one snapshot is always retained, regardless of age.
- MaximumSnapshotAge: Specifies the maximum age (in hours) for snapshots to be retained.
aws s3tables put-table-maintenance-configuration \
--table-arn arn:aws:s3tables:region:account_id:bucket/bucket_name/table/table_name \
--maintenance-configuration '{
"SnapshotManagement": {
"MinimumSnapshots": 1,
"MaximumSnapshotAge": 720
}
}
S3 Table Integration with AWS Analytics
S3 Tables integrate seamlessly with AWS analytics services to enable querying, processing and insight generation.
Amazon Athena - Run serverless SQL queries on S3 Tables
- Use AWS Glue to create a Data Catalog for S3 Tables.
- Query data directly using SQL in Athena.
- Leverage table formats like Apache Iceberg or Parquet for optimized performance.
AWS Glue - Automate ETL processes for S3 Tables
- Use Glue Crawlers to discover table metadata.
- Create ETL jobs to transform and load data into S3 Tables or other destinations.
S3 Metadata table
It includes system metadata including object tags and user defined metadata
stored into s3 table
generated in near real time during data creation so that it can be used in mins during query
Use case for S3 metadata table
- Real-Time Analytics
- efficient query execution on metadata to identify relevant data partitions.
- Machine Learning Pipelines
- metadata tables to filter, select, and partition data for model training.
- Governance and Compliance
- Track data retention and enforce lifecycle policies via metadata.
- Multi-Tenant Data Applications
- Use namespaces within metadata tables to logically isolate tenant data.
- Data Cataloging and Discovery
- Use metadata queries to identify datasets matching specific criteria.
Here is the sample python based function which uses metadata table query from athena.
`def query_metadata_table(criteria):
query = f"""
SELECT *
FROM {DATABASE}.{TABLE}
WHERE {criteria}
"""
print(f"Running query: {query}")
# Start Athena query
response = athena_client.start_query_execution(
QueryString=query,
QueryExecutionContext={'Database': DATABASE},
ResultConfiguration={'OutputLocation': S3_OUTPUT}
)
query_execution_id = response['QueryExecutionId']
# Wait for query completion
print("Waiting for query to complete...")
while True:
status = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
state = status['QueryExecution']['Status']['State']
if state in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
break
time.sleep(2)
if state != 'SUCCEEDED':
raise Exception(f"Query failed with state: {state}")
# Retrieve results
results = athena_client.get_query_results(QueryExecutionId=query_execution_id)
datasets = []
for row in results['ResultSet']['Rows'][1:]: # Skip the header row
datasets.append([col['VarCharValue'] for col in row['Data']])
print(f"Query returned {len(datasets)} datasets matching the criteria.")
return datasets`