Hi everyone,
What a re:Invent it has been so far with so many announcements across the board. My name is Peter Hanssens and I am a Serverless Hero based out of Sydney, Australia where I also run a Data Engineering meetup. I thought I'd spend some time talking about some announcements that are of interest to folks working within the data ecosystem.
Many of these announcements listed below are from Rahul Pathak's leadership session on harnessing the power of data with AWS analytics - well worth a watch if you haven't done so already.
Redshift
Redshift is a cloud data warehouse and, up until last re:Invent, coupled compute and storage. Now the RA3 instances have been around for a year, but the new XLPlus instances are available at a much lower price point which is great for established startups to take advantage of the innovative features it brings in being able to scale compute and storage independently.
Here are my top announcements for Redshift:
Amazon Redshift launches RA3.xlplus nodes with managed storage
Automatic Table Optimization - this is huge as you no longer need to think about distribution or sort keys!
Preview - Aqua for Redshift - game changing query performance - this looks to be a quantum leap forward for Redshift.
Preview - native JSON support - JSON and semi-structured data are a feature of many modern data sources and being able to parse this natively within Redshift means less pre-work in landing data into your warehouse.
Preview - Federated query support for RDS and Aurora MySQL - this makes it even easier ingest data into your data warehouse.
Preview - Amazon Redshift ML - another feature enabling data engineers to do more within the comforts of a data warehouse using SQL - very keen to see what folks can build with this great functionality.
Preview - Data Sharing - a great new feature that allows companies to share data with other third parties.
Preview - Native console integration with partners - another preview aimed at making data integration much faster with third parties such as Salesforce and Slack.
Glue
AWS Glue is a serverless (Yay!) ETL tool with a data catalogue baked in. There have been some wonderful announcements across re:Invent as well as pre:Invent!
Preview - Elastic Views - source data from RDS, Aurora, and DynamoDB using SQL to query across them and surface the results continuously in a materialised view to a variety of destinations including Redshift, S3 and Elasticsearch Service.
Pre:Invent - Schema Registry - this service allows better collaboration across teams maintaining data schemas which allows for schema evolution. It integrates with MSK, Kinesis and Lambda out of the box!
Pre:Invent - DataBrew - making data preparation easier is what this service is all about with the idea that it solves the challenge that data scientists using 80% of their time doing data prep - very much looking forward to exploring this service further.
Lake Formation
Lake Formation is a set of best practises in rolling out a data lake on AWS including security and governance.
Preview - Transactions, Row-level Security, and Acceleration - bringing lakehouse features to the data lake.
HealthLake - using the FHIR industry standard to bring together lots of disparate and unstructured data sources allowing for powerful querying and search capabilities.
EMR
EMR is a big data processing platform that gives you access to open source tools such as Presto, Spark, Flink and Hive to name a few.
EMR Studio - is a fully managed JupyterNotebook with a rich feature set that you can log into using SSO and your corporate credentials.
EMR on EKS - now you can run spark jobs on EKS with the rich feature set that EMR brings to the table.
Graviton2 instances - Graviton2 has been a revolution in compute performance and now its doing its thing with EMR with up to 30% lower cost and up to 15% improved performance.
AppFlow
AppFlow allows you to securely transfer data between SaaS apps such as Salesforce, Marketo, and Slack and AWS Services such as S3 and Redshift.
- Lookout for Metrics integration - you can now detect anomalies and unexpected changes in your metrics without needing to have machine learning expertise.
Batch
Batch is a service that optimally provisions the type and quantity of compute for batch processes that you would like to run.
- Fargate support - now you can submit your Batch jobs without needing to worry about patching your EC2 instances!
Neptune
Neptune is a fast and reliable managed graph database service - many data teams are using graph databases to store metadata and lineage for their data lakes.
- ML Integration - this allows you to run Graph Neural Networks over your data and return results within hours as opposed to weeks with traditional tabular methods.
Managed Airflow
Last but definitely not least, we have airflow which is a workflow orchestration service that allows data engineers create DAGs or directed acyclic graphs to manage dependencies across various data pipelines. Managing an airflow cluster can easily require a lot of effort so having this in a managed service is a huge win for data engineering teams already managing their own clusters.
Pre:Invent - MWAA - is a new serverless service that allows you to deploy airflow at scale rapidly.
Thanks for sticking with me for the long read - hope you enjoyed the wrap - and let me know what's your pick out of the lot?!