Scalable ETL pipeline for Google Merchant XML Feed and RDS with AWS Glue

Oleksandr Hanhaliuk - Nov 3 - - Dev Community

Handling and transforming data efficiently is essential when managing large, structured XML data like product catalogues from Google Merchant.

AWS Glue offers a serverless and highly scalable ETL service that simplifies this process.

In this article, I’ll walk through how to use AWS Glue to import and transform XML Google Merchant Product Feed, seamlessly integrating them into an RDS Postgres database.


Why Use AWS Glue for Product Feed Integration?

AWS Glue simplifies ETL (Extract, Transform, Load) processes, making handling data in structured formats such as XML easier. With AWS Glue Crawlers, you can automate schema detection, transforming data while importing it from an S3 source and loading it into an RDS target database.

Service level arhitecture

Figure 1: Service level architecture


Step 1: Setting Up S3 and XML Crawler

To get started, we’ll store our XML product feed in an S3 bucket and use Glue Crawlers to automatically detect the XML schema.

1.1 Create an S3 Bucket and Upload XML Product Feed

First, set up an S3 bucket to hold the XML files. Upload sample XML feed:

s3:///import/google-merchant-feed.xml

1.2 Configure Glue Database and XML Classifier

In AWS Glue, set up a database and custom XML classifier to parse the XML structure based on the Google Merchant feed format.

Figure 2: Glue XML database

  • Database: Create a Glue database called xml_merchant.
  • XML Classifier: Define the root XML element, typically “item” or “product” depending on your feed row element. This classifier helps Glue identify each product record.

1.3 Create and Run an XML Crawler

Once the classifier is defined, create a crawler to scan the S3 bucket and populate the database with schema information.

  • Crawler Configuration: Set it to scan the S3 bucket and choose the xml_merchant database.
  • Run Crawler: After configuring, start the crawler and check that it correctly identifies the XML structure. The crawler will automatically populate Glue Data Catalog with the detected schema.

Figure 3: Crawled XML schema


Step 2: Setting Up RDS Crawler for Target Database

AWS Glue also needs schema information for the target RDS database. Configure a crawler to detect schema in the RDS instance.

2.1 Create an RDS Connection

Configure a database connection in Glue to connect to your RDS instance.

  • RDS Credentials: Enter the database credentials and VPC security settings to allow Glue to access the RDS instance.

Figure 4: RDS DB connection - 1

Figure 4: RDS DB connection - 2

2.2 Run RDS Crawler

Run the crawler to parse tables in the Glue Catalog database from RDS. This step ensures Glue has the necessary information to map data from the XML feed into RDS.


Step 3: AWS Glue Jobs for Data Transformation

With Glue Jobs, we can create two processes: one for importing product categories and another for the products themselves. These jobs allow for custom transformations and schema mapping.

Importing Products from Feed

The main Glue Job imports and processes the products themselves. Using Glue’s visual editor, you can configure transformations and map XML fields to target RDS columns.

Figure 5: Glue Visual Job

  • [1] Connect with relational tables in RDS DB — allows you to do select queries related to tables of database to generate final import data
  • [2] Change schema — Map XML fields like product ID, title, price, and category to the corresponding RDS fields. Filter any unnecessary data from the XML to optimize processing.

Figure 6: Map data fields

  • [3] SQL Query — query to combine data from the feed with relational tables
  • [4] Drop fields — drop unnecessary fields
  • [5] Target — define the final target database and table

Step 4: Executing the Glue Jobs

Once your Glue Jobs are set up, you can start by running the category import job and then proceed to the product import job. Both jobs are easily monitored through the Glue Console.

Figure 7: Run job

Other options

There are several options for input, transform and output processes

Figure 8-1: input options

Figure 8-2: transform options

Figure 8-3: output options

Data Transformation and Performance Insights

AWS Glue efficiently transforms and imports data from XML to RDS. In testing, the setup imported 50000 products in just two minutes using a minimal setup of AWS Glue with minimal-sized RDS DB, indicating high performance with minimal resource consumption.


Step 5: Validating the Data Import

After the Glue Jobs have run successfully:

  • Data Validation: Check the imported data in RDS to confirm record counts and data integrity.
  • Error Logging: AWS Glue integrates with CloudWatch, making it easy to monitor and resolve any errors in real time.

Benefits of AWS Glue for XML Product Feed Integration

  • Scalability: AWS Glue handles large XML files efficiently, scaling ETL resources automatically.

  • Schema Detection: Glue Crawlers can detect and adapt to XML schema changes, making future imports simpler.
    Automated Job Scheduling: Jobs can be scheduled based on your data refresh needs, providing a robust and automated data pipeline.

  • Easy to use: AWS Glue is easy to use not only for developers, but for people who doesn’t write code — Business Analysts, Data Scientists, Product Managers, etc


Conclusion

AWS Glue simplifies the process of importing and transforming XML product feeds, offering a reliable and scalable solution for integrating Google Merchant Feed into an RDS database. By automating schema detection, data transformation, and job scheduling, AWS Glue streamlines the ETL pipeline, allowing seamless integration with minimal manual intervention.

. . . . . .