Introduction

As technology advances, accessibility is an important part of designing inclusive digital experiences. Numerous users struggle to navigate new interfaces with visual impairments, motor limitations, or other needs for accessibility. Voice interaction can bridge the gap by allowing hands-free navigation and audio feedback. As part of my personal project, I created an assistive menu using Amazon Transcribe and Polly along with React as the frontend and see how voice technology can be used to improve accessibility.

What Are Amazon Transcribe & Polly?

Amazon Transcribe – A speech-to-text service that converts spoken language into written text in real-time or batch mode, enabling voice-based interactions in applications.

Amazon Polly – A text-to-speech service that transforms text into natural-sounding speech, providing audio feedback to improve accessibility for users who benefit from auditory prompts.

Why Amazon Transcribe and Polly?

As essential elements of my assistive menu, I selected Amazon Polly and Amazon Transcribe to increase accessibility. With Amazon Transcribe, users may use voice commands to interact with the interface, allowing hands-free navigation by converting speech to text. In order to improve the experience for users who are visually impaired or who require auditory cues, Amazon Polly transforms text into speech that sounds natural. Users can traverse the menu more conveniently and inclusively without depending entirely on conventional input techniques thanks to the combination of these technologies.

Implementation Overview

I integrated these services by following these key steps.

1.User Speech Capture:

Utilized Web Speech API for recording user input.
Sent the audio to Amazon Transcribe for the transcription process.

2.Speech-To-Text Processing:

Transcribe returning a text version of the user's speech.
Parsed the text for relevant commands (e.g., "open menu")

3.Generating Speech Feedback:

Used Amazon Polly for converting system responses into speech.
Played the generated audio from blocks of text to provide verbal feedback.

4.UI Integration

Built a simple UI that can be updated manually or using voice command.
Provided both text-based and audio-based feedback.

Code Implementation

Transcribing Speech:

export async function transcribeSpeech(setToastMessage, setShowToast) {
  try {
    setToastMessage("🎤 Listening...");
    setShowToast(true);
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    const mediaRecorder = new MediaRecorder(stream);
    const audioChunks = [];

    mediaRecorder.ondataavailable = (event) => {
      audioChunks.push(event.data);
    };

    return new Promise((resolve, reject) => {
      mediaRecorder.onstop = async () => {
        try {
          setToastMessage("📤 Uploading audio...");
          const audioBlob = new Blob(audioChunks, { type: "audio/webm" });
          const s3Uri = await uploadAudioToS3(audioBlob);
          setToastMessage("⏳ Transcription in progress...");

          const transcribeParams = {
            TranscriptionJobName: `transcription-${Date.now()}`,
            LanguageCode: "en-US",
            MediaFormat: "webm",
            Media: { MediaFileUri: s3Uri },
            OutputBucketName: bucketName,
          };

          await transcribeClient.send(new StartTranscriptionJobCommand(transcribeParams));
          let jobData;
          let transcriptionStatus;
          do {
            await new Promise((res) => setTimeout(res, 3000));
            jobData = await transcribeClient.send(new GetTranscriptionJobCommand({
              TranscriptionJobName: transcribeParams.TranscriptionJobName,
            }));
            transcriptionStatus = jobData.TranscriptionJob.TranscriptionJobStatus;
          } while (transcriptionStatus === "IN_PROGRESS");

          if (transcriptionStatus !== "COMPLETED") {
            reject("Transcription failed.");
            return;
          }

          setToastMessage("✅ Transcription complete!");
          const resultUrl = new URL(jobData.TranscriptionJob.Transcript.TranscriptFileUri);
          const response = await fetch(resultUrl.href);
          const json = await response.json();
          resolve(json.results.transcripts[0].transcript);
        } catch (error) {
          setToastMessage("❌ Error during transcription.");
          reject(error.message);
        }
      };
      mediaRecorder.start();
      setTimeout(() => mediaRecorder.stop(), 3000);
    });
  } catch (error) {
    throw new Error("Failed to record audio: " + error.message);
  }
}

Uploading Audio to S3

export async function uploadAudioToS3(audioBlob) {
  const fileKey = `recordings/${Date.now()}.webm`;
  try {
    const arrayBuffer = await audioBlob.arrayBuffer();
    await s3Client.send(
      new PutObjectCommand({
        Bucket: bucketName,
        Key: fileKey,
        Body: arrayBuffer,
        ContentType: "audio/webm",
      })
    );
    return `s3://${bucketName}/${fileKey}`;
  } catch (error) {
    console.error("Error uploading audio to S3:", error);
    throw error;
  }
}

Synthesizing Speech with Polly

export async function synthesizeSpeech(text) {
  const params = {
    OutputFormat: "mp3",
    Text: text,
    VoiceId: "Joanna",
  };

  try {
    const command = new SynthesizeSpeechCommand(params);
    const response = await pollyClient.send(command);
    if (!response.AudioStream) {
      throw new Error("Polly did not return an audio stream.");
    }
    const audioStream = await response.AudioStream.transformToByteArray();
    const audioBlob = new Blob([audioStream], { type: "audio/mpeg" });
    return URL.createObjectURL(audioBlob);
  } catch (error) {
    console.error("Error synthesizing speech:", error);
    throw error;
  }
}

To use Amazon Transcribe, Polly, and S3, you need an IAM user with the correct permissions. This ensures secure access to AWS services within your project

Essential Policies to Attach:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "transcribe:StartTranscriptionJob",
                "transcribe:GetTranscriptionJob",
                "transcribe:ListTranscriptionJobs"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR_BUCKET_NAME",
                "arn:aws:s3:::YOUR_BUCKET_NAME/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "polly:SynthesizeSpeech"
            ],
            "Resource": "*"
        }
    ]
}

Challenges and Limitations

Transcribe Processing Delay: Despite Transcribe being accurate—one major challenge I encountered was that Amazon Transcribe took a while to process speech, leading to delays in activating voice commands. This impacted the user experience, as the system was not as responsive as I intended.
Multilingual and Accent Challenges in Polly: I noticed that some voices may not accurately capture regional pronunciations or speech patterns. This can lead to a less natural listening experience, especially when switching between languages or using localized terms.

Check it out on GitHub

You can find my full source code for this assistive menu project on GitHub: Assistive Menu Project

Menu Preview

Enabling Read Aloud

Activating Voice Command

Accessibility Help

Conclusion

Integrating Amazon Transcribe and Polly into a project was an insightful learning experience. It reinforced the importance of accessibility and opened various ideas for voice-driven applications. If you're interested in adding similar features to your projects, AWS provides a powerful set of tools to get started.

Enhancing Accessibility: Integrating Amazon Transcribe & Polly into an Assistive Menu