This article seeks to articulate a number of common machine learning features and capabilities that can be integrated into applications, data pipelines, business intelligence tools and many other projects depending on your peculiar use case by leveraging utility scripts that invoke the Application Programming Interface(API) of managed ML services provided by AWS.
Prerequisites:
- AWS credentials: Configure AWS Access Keys via the Command Line Interface(CLI) >> Environment Variables can equally be utilised to ensure secure and persistent access to your AWS account.
- AWS Software Development Kit(SDK) for Python: Boto3
pip install boto3
Utility scripts can be very useful when considering an idea to test and develop for a specific feature of a program without necessarily having a full stack application developed yet. Some utility scripts to perform cloud operations can be found in this GitHub repository which I compiled for both Python and Bash scripts:
Below are some utility scripts to implement specific machine learning features that you can use in your projects.
A. Language Translation Utility
Use Case: Translate text files or strings into various languages.
AWS Service: Amazon Translate
Implementation:
Automate the translation of text or files into target languages.
Utility Script:
import boto3
translate = boto3.client('translate')
text = "Utility Scripts to Expedite Machine Learning(ML) Adoption Leveraging AWS ML Service APIs.
"
response = translate.translate_text(
Text=text,
SourceLanguageCode='en',
TargetLanguageCode='fr'
)
print(response['TranslatedText'])
# Output: 'Scripts utilitaires pour accélérer l'adoption de l'apprentissage automatique (ML) en tirant parti des API de service AWS ML.'
B. Text/Data Extraction Utility
Use Case: Extract key data from invoices (Example: invoice numbers, totals) and classify them into categories for financial processing.
AWS Service: Amazon Textract is a machine learning service offered by AWS that reduces the manual efforts and automates the process involved in the extraction of any kind of data such as forms, tables, and texts from scanned documents, making it expedient to derive important information from different sources. I wrote an article that explores this service in detail.Check it out here:
Implementing an Organisational Cloud Resource Tagging Strategy Using Amazon Textract, AWS Lambda and Boto3.
Wonder Agudah for AWS Community Builders ・ Feb 29 '24
Utility Script:
import boto3
import os
import csv
import re
s3 = boto3.client('s3')
textract = boto3.client('textract')
def upload_files_to_s3(folder, bucket_name):
for root, _, files in os.walk(folder):
for file in files:
if file.endswith(('.pdf', '.jpg', '.png')):
filepath = os.path.join(root, file)
s3.upload_file(filepath, bucket_name, file)
def extract_text(bucket_name, document_name):
response = textract.analyze_document(
Document={'S3Object': {'Bucket': bucket_name, 'Name': document_name}},
FeatureTypes=['FORMS']
)
return response
def parse_invoice_data(textract_response):
data = {}
for block in textract_response['Blocks']:
if block['BlockType'] == 'KEY_VALUE_SET' and block['EntityTypes'] == ['KEY']:
key = ''.join([
item['Text']
for item in block.get('Relationships', [])
if 'Text' in item
]).strip()
if key in ['Invoice Number', 'Total', 'Date']:
data[key] = ''.join([
item['Text']
for item in block.get('Relationships', [])
if 'Text' in item
]).strip()
return data
def save_to_csv(data, csv_file):
with open(csv_file, 'a', newline='') as csvfile:
writer = csv.writer(csvfile)
for item in data:
writer.writerow([item['Invoice Number'], item['Date'], item['Total']])
# Main function
def main():
folder_path = './invoices'
bucket_name = 'your-invoice-bucket'
csv_file = 'invoice_data_output.scv'
# Upload files to S3
upload_files_to_s3(folder_path, bucket_name)
# Process each file in the S3 bucket
processed_data = []
for obj in s3.list_objects_v2(Bucket=bucket_name).get('Contents', []):
file_name = obj['Key']
print(f"Processing file: {file_name}")
# Extract text using Textract
textract_response = extract_text(bucket_name, file_name)
# Parse relevant data
parsed_data = parse_invoice_data(textract_response)
if parsed_data:
processed_data.append(parsed_data)
# Save extracted data to CSV
save_to_csv(processed_data, csv_file)
print(f"Processed data saved to {csv_file}")
if __name__ == "__main__":
main()
C. Text Classification and Analysis Utility
Use Case: Analyze text data for sentiment and language detection.
AWS Service: Amazon Comprehend
Utility Script:
import boto3
comprehend = boto3.client('comprehend')
text = "Education is necessary for development!"
response = comprehend.detect_sentiment(Text=text, LanguageCode='en')
print(response['Sentiment'])
# Output: 'POSITIVE'
D. Image Classification and Processing Utility
Use Case: Automate image analysis for object detection and facial recognition.
AWS Service: Amazon Rekognition
Utility Script:
import boto3
import os
import csv
s3 = boto3.client('s3')
rekognition = boto3.client('rekognition')
def upload_to_s3(folder, bucket_name):
for root, _, files in os.walk(folder):
for file in files:
if file.endswith(('.png', '.jpg', '.jpeg')):
filepath = os.path.join(root, file)
s3.upload_file(filepath, bucket_name, file)
def label_images(bucket_name, output_csv):
with open(output_csv, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Image', 'Labels'])
for obj in s3.list_objects(Bucket=bucket_name).get('Contents', []):
response = rekognition.detect_labels(
Image={'S3Object': {'Bucket': bucket_name, 'Name': obj['Key']}},
MaxLabels=10
)
labels = [label['Name'] for label in response['Labels']]
writer.writerow([obj['Key'], ', '.join(labels)])
folder_path = './images'
bucket_name = 'your-bucket-name'
output_csv = 'image_labels.csv'
upload_to_s3(folder_path, bucket_name)
label_images(bucket_name, output_csv)
print(f"Labels stored in {output_csv}")
The utility scripts in this article are by no means conclusive! Consider them a starting point to explore different managed machine learning services offered by AWS that you can integrate into your applications and projects. You can go through the official documentation for Boto3:
to explore more code snippets you can customise and tailor to your unique case scenarios for implementing machine learning capabilities by leveraging AWS managed ML services.