In - depth Analysis of Speech Synthesis Technology in HarmonyOS Next

This article aims to deeply explore the speech synthesis technology in the Huawei HarmonyOS Next system (up to API 12 as of now), and summarize it based on actual development practices. It mainly serves as a vehicle for technical sharing and communication. There may be mistakes and omissions. Colleagues are welcome to put forward valuable opinions and questions so that we can make progress together. This article is original content, and any form of reprint must indicate the source and the original author.

I. Principles and Functional Requirements of Speech Synthesis

(1) Basic Principles

In the speech world of HarmonyOS Next, speech synthesis technology is like a magical wizard that transforms cold text into vivid speech. Its core principles mainly include two major parts: text analysis and the speech synthesis model.

During the text analysis stage, the system first pre - processes the input text, such as word - segmentation, part - of - speech tagging, and prosody analysis. For example, for the sentence "今天天气真好。", it will first split it into words like "今天", "天气", "真好", mark the part - of - speech of each word, and at the same time analyze the prosodic structure of the sentence to determine which words need to be stressed, the intonation fluctuations, etc. This step provides basic information for subsequent speech synthesis.

Then comes the work of the speech synthesis model. Common speech synthesis models include methods based on parametric synthesis and waveform - concatenation synthesis. Parametric - based synthesis generates speech parameters such as fundamental frequency and formants by building an acoustic model according to the results of text analysis, and then converts these parameters into speech waveforms through a vocoder. Waveform - concatenation - based synthesis selects appropriate speech segments from a large pre - recorded speech segment library according to the results of text analysis and splices them to generate the final speech.

(2) Analysis of Functional Requirements

Requirement for Multilingual Support As a global operating system, HarmonyOS Next requires speech synthesis to meet the needs of different languages. There are huge differences in grammar, pronunciation rules, and prosody among different languages. For example, Chinese is a tonal language, and each syllable has a different tone, while English is an intonation - based language, expressing different semantics through intonation changes. Therefore, speech synthesis technology needs to establish corresponding language models and pronunciation libraries according to the characteristics of different languages to ensure the accuracy and naturalness of the synthesized speech.
Requirement for Customizing Speech Styles Users' demands for speech styles are becoming increasingly diverse. In different application scenarios, different speech styles may be required. For example, in a smart assistant, a friendly and natural speech style may be needed to better interact with users; while in an audiobook, an emotional and expressive speech style may be required to enhance the listener's reading experience. This requires speech synthesis technology to provide a variety of speech style options and be able to customize according to user needs.

(3) Comparison of Different Speech Synthesis Technologies

Comparison between Parametric - based Synthesis and Waveform - concatenation - based Synthesis The advantage of parametric - based synthesis is that the synthesized speech has good controllability in terms of timbre and prosody, and the model is relatively small, taking up fewer resources. However, the naturalness of the synthesized speech is relatively low, especially when dealing with complex speech phenomena such as liaison and assimilation. Waveform - concatenation - based synthesis can generate more natural and fluent speech because it directly uses real speech segments for splicing. However, this method requires a large speech segment library, takes up a large amount of storage space, and has a high computational complexity during the synthesis process.
Comparison of Speech Synthesis Technologies from Different Vendors (if applicable) Speech synthesis technologies from different vendors also vary. The technologies of some vendors perform well in certain languages or specific scenarios, but may have deficiencies in other aspects. For example, some vendors have high accuracy and naturalness in English speech synthesis, but may have problems such as inaccurate pronunciation or unnatural prosody in Chinese speech synthesis. When choosing speech synthesis technology, comprehensive consideration needs to be given according to specific application requirements and the target user group.

II. Implementation of Speech Synthesis Function in Core Speech Kit

(1) Introduction to Functional Interfaces and Classes

Core Speech Kit provides developers with rich functional interfaces and classes, facilitating the integration of speech synthesis functions into HarmonyOS Next applications. The TextToSpeechEngine class is one of the core classes. It provides interfaces for creating a speech synthesis engine, setting speech parameters, and synthesizing speech. For example, an instance of the speech synthesis engine can be created through the create method, and methods such as setPitch and setSpeed can be used to set parameters such as the intonation and speed of the speech.

(2) Code Example and Speech Parameter Settings

The following is a simple code example (simplified version) of using Core Speech Kit for speech synthesis:

import { textToSpeech } from '@kit.CoreSpeechKit';

// Create a speech synthesis engine
let ttsEngine = textToSpeech.TextToSpeechEngine.create();

// Set speech parameters
ttsEngine.setPitch(1.2); // Set the intonation, 1.0 is the normal intonation, values greater than 1.0 raise the intonation, and values less than 1.0 lower the intonation
ttsEngine.setSpeed(0.8); // Set the speaking speed, 1.0 is the normal speed, values less than 1.0 slow down the speed, and values greater than 1.0 speed up the speed

// Text to be synthesized
let text = "欢迎使用HarmonyOS Next语音合成技术。";

// Synthesize the speech
ttsEngine.speak(text);

In this example, an instance of the speech synthesis engine is first created. Then, the intonation is set to 1.2 times the normal intonation, and the speaking speed is set to 0.8 times the normal speed. Finally, the specified text content is synthesized.

(3) Evaluation of the Naturalness and Fluency of Synthesized Speech

In practical use, the speech synthesis function of Core Speech Kit performs well in terms of naturalness and fluency. For common text content, the synthesized speech has accurate pronunciation and natural intonation, and can better express the semantics and emotions of the text. For example, when reading a news article, the pauses and stresses of the speech are properly processed, and it sounds relatively fluent. However, in some special cases, such as dealing with rare characters, technical terms, or complex sentence structures, there may be situations where the pronunciation is not accurate enough or the intonation is not natural enough. Overall, the quality of the synthesized speech can meet the needs of most daily application scenarios.

III. Expansion and Optimization of Speech Synthesis Applications

(1) Expansion of Application Scenarios

Smart Assistant Scenario In the smart assistant application, speech synthesis technology is a key link in realizing human - machine interaction. Users ask questions or issue commands to the smart assistant through voice, and the smart assistant uses speech synthesis technology to feedback the answers in a natural and friendly voice form. For example, when a user asks about the weather, the smart assistant not only needs to accurately understand the user's question but also answer with a clear and natural voice, such as "今天天气晴朗，气温25摄氏度，适合外出活动。", allowing the user to experience communication similar to that with a real person.
Audiobook Scenario For audiobook applications, speech synthesis technology can convert a large amount of text content into vivid voice readings. By optimizing the style and expressiveness of speech synthesis, it can bring an immersive reading experience to readers. For example, when reading a novel, parameters such as the timbre, speed, and intonation of the voice are adjusted according to different characters and plot developments, enabling readers to better understand and feel the emotional changes in the story.

(2) Optimization Strategies

Improving Synthesis Quality through Data Augmentation To improve the quality of speech synthesis, data augmentation techniques can be adopted. For example, operations such as pitch - shifting, speed - changing, and noise - adding are performed on the speech data used to train the speech synthesis model to increase data diversity. This allows the model to learn more speech features in different situations, thereby improving the robustness and naturalness of the synthesized speech. At the same time, collecting more speech data of different types and styles for training also helps to improve the synthesis effect.
Optimizing Model Structure to Reduce Resource Occupancy To address the problem of large resource occupancy of speech synthesis models, the model structure can be optimized. For example, lightweight neural network architectures are adopted to reduce the number of model parameters and computational complexity. At the same time, through model compression techniques such as pruning and quantization, the size of the model is further reduced, and the running efficiency of the model is improved, enabling it to better adapt to the resource limitations of HarmonyOS Next devices.

(3) Development Experience and Precautions

Pay Attention to Text Pre - processing When using speech synthesis technology, attention should be paid to the pre - processing of text. Ensure that the text format is correct and the encoding is unified to avoid garbled characters or unrecognizable characters. Appropriate processing should be carried out for some special symbols, abbreviations, etc., to ensure the accuracy of speech synthesis. For example, convert the "&" symbol to "和", and "etc." to "等等".
Reasonably Set Speech Parameters Set speech parameters reasonably according to different application scenarios and user needs. However, be careful not to over - adjust the parameters, as this may result in unnatural - sounding speech. For example, both too fast and too slow speaking speeds may affect users' understanding and experience. At the same time, adjust the speech parameters in a timely manner according to the performance of the device and user feedback to achieve the best synthesis effect. It is hoped that through the introduction of this article, everyone can have a deeper understanding of the speech synthesis technology in HarmonyOS Next, and be able to better apply this technology in actual development to bring users a higher - quality speech experience. If you encounter other problems during the practice process, you are welcome to communicate and discuss together! Haha!