Practical Application of Speech Synthesis and Model Optimization in the Intelligent Voice Assistant of HarmonyOS Next

This article aims to deeply explore the practical application of speech synthesis and model optimization technologies in the process of building an intelligent voice assistant based on the Huawei HarmonyOS Next system (up to API 12 as of now), and summarize it based on actual development experience. It mainly serves as a vehicle for technical sharing and communication. There may be mistakes and omissions. Colleagues are welcome to put forward valuable opinions and questions so that we can make progress together. This article is original content, and any form of reprint must indicate the source and the original author.

I. Functional Requirements and Architecture Planning of the Voice Assistant

(1) Sorting out of Functional Requirements

Requirements for Speech Command Recognition The intelligent voice assistant needs to accurately recognize users' speech commands. Regardless of how the user's accent, speaking speed, and intonation change, it should be able to convert the speech into correct text commands. This requires the speech recognition model to have high robustness and accuracy. For example, users may ask about the weather in different ways, such as "What's the weather like today?" or "Help me check the weather today." The voice assistant should be able to accurately understand and recognize these commands.
Requirements for Speech Synthesis Responses According to the user's commands, the voice assistant needs to respond with clear, natural, and emotional speech. The quality of speech synthesis directly affects the user's auditory experience. Therefore, it is necessary to provide a variety of speech styles and timbres for users to choose from to meet the needs of different scenarios and user preferences. For example, when broadcasting news, use a formal and stable speech style; when telling stories, adopt a vivid and emotional speech style.
Requirements for Personalized Services In order to provide a better user experience, the voice assistant should have the ability to provide personalized services. By learning the user's usage habits, preferences, and historical records, it can provide customized answers and recommendations for users. For example, actively push relevant information according to the content that the user often queries; automatically select the appropriate speech to answer according to the user's preference for the speech style.

(2) Architecture Design Based on HarmonyOS Next

Speech Input Processing Module It is responsible for receiving the user's speech input, preprocessing the speech signal, such as noise reduction, audio format conversion, etc., to improve the quality of the speech signal and provide better input data for subsequent speech recognition. For example, in a noisy environment, background noise is removed through a noise reduction algorithm, making it easier for the speech recognition model to recognize the user's speech content.
Natural Language Understanding Module It conducts semantic understanding and analysis of the text output by the speech input processing module, extracts key information, and determines the user's intention. This module usually uses natural language processing models, such as recurrent neural networks (RNNs) or Transformer models in deep learning models. For example, when the user asks "Play Jay Chou's songs," the natural language understanding module can analyze that the user's intention is to play the music of a specific singer and extract the key information "Jay Chou."
Model Inference Module According to the user's intention determined by the natural language understanding module, it invokes the corresponding service or functional module for processing and obtains the result through model inference. For example, if the user's intention is to query the weather, the model inference module will call the weather query service, obtain the weather information, and prepare it for speech synthesis output.
Speech Synthesis Output Module The Core Speech Kit is used to convert the result obtained by model inference into speech output. According to the user's preferences and scene requirements, an appropriate speech style and timbre are selected for synthesis to make the speech response more natural and vivid.

(3) Technical Integration to Improve Overall Performance

Integration of Speech Synthesis Technology (Core Speech Kit) In the speech synthesis output module, the Core Speech Kit is integrated to implement the speech synthesis function. Through its rich interfaces, speech parameters such as speaking speed, intonation, and volume are set to meet different speech style requirements. For example, when broadcasting an emergency notice, the speaking speed and volume can be increased; when telling a story, the intonation can be adjusted appropriately to increase emotional expression.
Integration of Model Optimization Technology (such as Model Quantization) For natural language processing models, etc., model quantization technology is adopted to reduce the model size and computational amount. After the model training is completed, a quantization tool is used to convert the parameters in the model from high-precision data types (such as 32-bit floating-point numbers) to low-precision data types (such as 8-bit integers). This can not only reduce the storage requirements of the model but also improve the operation efficiency of the model on the device, enabling the voice assistant to perform model inference more quickly and shorten the response time.

II. Development of Key Functions and Technological Innovation

(1) Implementation and Customization of the Speech Synthesis Function

Implementation and Customization Example Using the Core Speech Kit The following is a simple code example showing how to use the Core Speech Kit to implement speech synthesis and customize the speech style (simplified version):

import { textToSpeech } from '@kit.CoreSpeechKit';

// Create a speech synthesis engine
let ttsEngine = textToSpeech.TextToSpeechEngine.create();

// Set speech parameters
ttsEngine.setPitch(1.2); // Increase the intonation to make the speech more vivid
ttsEngine.setSpeed(0.9); // Slow down the speaking speed slightly to enhance the expression effect
ttsEngine.setVolume(0.8); // Reduce the volume appropriately to make the speech sound softer

// The text to be synthesized
let text = "Welcome to use the intelligent voice assistant. What can I help you with today?";

// Synthesize the speech
ttsEngine.speak(text);

In this example, first, a speech synthesis engine is created. Then, through methods such as setPitch, setSpeed, and setVolume, speech parameters such as intonation, speaking speed, and volume are set. Finally, the specified text content is synthesized, realizing a simple speech synthesis function and conducting preliminary customization of the speech style.

(2) Demonstration of the Model Optimization Process

Model Quantization Optimization Process and Code Snippet Suppose we are using a natural language processing model trained with the TensorFlow framework. The following is a simplified example of the model quantization process:

import tensorflow as tf
from tensorflow.python.tools import freeze_graph
from tensorflow.python.tools import optimize_for_inference_lib

# Load the original model
model_path = 'original_model.pb'
graph = tf.Graph()
with graph.as_default():
    od_graph_def = tf.compat.v1.GraphDef()
    with tf.io.gfile.GFile(model_path, 'rb') as fid:
        serialized_graph = fid.read()
        od_graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(od_graph_def, name='')

# Define input and output nodes
input_tensor = graph.get_tensor_by_name('input:0')
output_tensor = graph.get_tensor_by_name('output:0')

# Prepare the calibration dataset (assuming the calibration dataset has been obtained)
calibration_data = get_calibration_data()

# Perform model quantization
with tf.compat.v1.Session(graph=graph) as sess:
    # Freeze the model
    frozen_graph = freeze_graph.freeze_graph_with_def_protos(
        input_graph_def=graph.as_graph_def(),
        input_saver_def=None,
        input_checkpoint=None,
        output_node_names='output',
        restore_op_name=None,
        filename_tensor_name=None,
        output_graph='frozen_model.pb',
        clear_devices=True,
        initializer_nodes=None
    )
    # Optimize the model
    optimized_graph = optimize_for_inference_lib.optimize_for_inference(
        input_graph_def=frozen_graph,
        input_node_names=['input'],
        output_node_names=['output'],
        placeholder_type_enum=tf.float32.as_datatype_enum
    )
    # Quantize the model
    converter = tf.lite.TFLiteConverter.from_session(sess, [input_tensor], [output_tensor])
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    converter.inference_input_type = tf.uint8
    converter.inference_output_type = tf.uint8
    tflite_model = converter.convert()
    # Save the quantized model
    with open('quantized_model.tflite', 'wb') as f:
        f.write(tflite_model)

In this example, first, the original TensorFlow model is loaded, and then the input and output nodes are defined, and the calibration dataset is prepared. Then, through a series of steps, including freezing the model and optimizing the model, finally, the TFLiteConverter is used for quantization operations, and the quantized model is saved in the .tflite format for deployment on HarmonyOS Next devices, achieving the quantization optimization of the model and reducing the model size and computational amount.

(3) Introduction of Distributed Computing Capability

Distributed Architecture Design and Implementation Details In order to improve the response speed and processing capability of the voice assistant, distributed computing capability is introduced. Modules such as speech input processing, natural language understanding, model inference, and speech synthesis output are distributed to different device nodes to work together. For example, in a HarmonyOS Next ecosystem that includes multiple smart devices (such as smartphones, smart speakers, smart watches, etc.), the speech input processing and speech synthesis output modules can be deployed on devices close to the user (such as smartphones or smart watches) to reduce the transmission latency of audio data; the natural language understanding and model inference modules are deployed on devices with strong computing capabilities (such as smart speakers or cloud servers) to improve processing efficiency.

In the implementation process, the distributed communication mechanism of HarmonyOS Next, such as the distributed soft bus technology, is used to achieve data transmission and task scheduling between devices. For example, when the user issues a speech command on a smartphone, the speech input processing module preprocesses the speech, and then transmits the processed text data to the natural language understanding module on the smart speaker for semantic analysis through the distributed soft bus. After the natural language understanding module analyzes the user's intention, it sends the task request to the cloud server or other devices with strong computing capabilities for model inference through the distributed soft bus. Finally, the inference result is transmitted back to the smartphone or smart speaker through the distributed soft bus, and the speech synthesis output module converts the result into speech and outputs it to the user.

III. Performance Testing and User Experience Improvement

(1) Performance Testing Indicators and Data Comparison

Evaluation of Speech Synthesis Naturalness The naturalness of speech synthesis is evaluated by combining subjective evaluation and objective evaluation. For subjective evaluation, a certain number of users are invited to score the output of speech synthesis, and evaluate it from aspects such as the fluency of the speech, the naturalness of the intonation, and the emotional expression, and take the average value as the subjective score. For objective evaluation, some speech quality evaluation indicators can be used, such as Mel-frequency cepstral coefficients (MFCC), speech distortion degree (MOS), etc. For example, before optimization, the subjective score of speech synthesis may be 70 points (out of 100 points). After optimization, by adjusting speech parameters and improving the synthesis algorithm, the subjective score is increased to 85 points, and at the same time, the objective evaluation indicators also show a significant improvement in speech quality.
Model Inference Latency Testing A high-precision timer is used to measure the time interval from the input text to the model output result as the evaluation indicator of the model inference latency. Before optimization, for a natural language processing task of medium complexity, the model inference latency may be 500 milliseconds. After model quantization and distributed computing optimization, the latency is reduced to less than 200 milliseconds, greatly improving the response speed of the system.
Measurement of the Overall System Response Time The time spent in the whole process from when the user issues a speech command to when they hear the speech response is regarded as the overall system response time. Tests are carried out under different network environments and device load conditions, and the response times before and after optimization are compared. For example, before optimization, the average overall system response time is 2 seconds when the network condition is good, and it may reach more than 5 seconds when the network is congested; after optimization, the response time is shortened to less than 1 second when the network condition is good, and it can be controlled within about 3 seconds when the network is congested, significantly improving the user experience.

(2) User Experience Optimization Measures

Optimization of the Speech Synthesis Caching Strategy In order to reduce the latency of speech synthesis, a caching strategy is adopted. For some commonly used answer texts, such as greetings and answers to common questions, the speech synthesis results are cached. When the user requests the same content again, the speech data is directly retrieved from the cache without the need for re-speech synthesis, thereby improving the response speed. At the same time, according to the usage frequency and timeliness of the cache, the cache space is dynamically managed, and the infrequently used or expired cache data is cleared in a timely manner.
Adjusting Model Parameters According to User Feedback Collect user feedback, such as evaluations of the accuracy of speech answers and the quality of speech synthesis. According to this feedback, analyze the possible problems of the model and adjust the model parameters accordingly. For example, if users feedback that the answers to certain types of questions are inaccurate, it may be that the model has insufficient learning of these types of samples during the training process. By adding relevant samples for retraining or adjusting the weight parameters of the model, the accuracy of the model can be improved.
Improving the Speech Interaction Process Optimize the interaction process of the voice assistant to make it more natural and convenient. For example, after the user asks a question, if the voice assistant needs to further ask the user for more information, design more friendly and clear prompt messages to guide the user to naturally provide the required information. At the same time, optimize the recognition and processing logic of speech commands, reduce unnecessary confirmation steps, and improve the interaction efficiency.

(3) User Test Feedback and Experience Sharing

Display of Actual User Test Feedback During the actual user test, users of different ages and occupations were invited to try out the optimized voice assistant. Users feedback that in terms of speech synthesis quality, the speech is more natural and vivid, and it sounds more comfortable; in terms of response speed, they can clearly feel that the system answers questions faster and the interaction is more fluent. For example, an office worker who often uses the voice assistant to query information said: "In the past, when querying weather information, the voice assistant always answered a bit slowly. Now it almost answers instantly, and the speech sounds very comfortable. It feels like chatting with a real person."
Summary of Development Experience and Precautions
- Experience Summary: In the development process, paying attention to user needs and experience is the key. By continuously collecting user feedback and carrying out targeted optimization, the performance and user satisfaction of the voice assistant can be effectively improved. At the same time, rationally using the technical features of HarmonyOS Next, such as distributed computing, speech synthesis, and model optimization technologies, can create an efficient and intelligent voice assistant.
- Precautions: In the model quantization process, attention should be paid to the selection of the calibration dataset and the setting of quantization parameters to avoid a decline in model performance due to quantization. In the design of the distributed computing architecture, the communication latency and reliability between devices should be fully considered to ensure the stable and efficient transmission of data. In addition, in the customization of speech synthesis, speech parameters should be carefully designed according to different application scenarios and user groups to provide the best speech experience. It is hoped that through the introduction of this article, it can provide some useful references and lessons for developers in the field of intelligent voice assistants, and jointly promote the development of intelligent voice technology. If you encounter other problems in the practice process, you are welcome to communicate and discuss together! Haha!