Making your own assistant with Qt/QML and Google Speech-to-Text

In previous articles we explored Qt/QML in different ways so in this one we are going to continue with more Cloud integration. We are going to use Google Cloud Speech-to-Text API to be able to recognize basic voice commands which we could put in our home on embedded device like Raspberry PI with a nice graphical interface as part of our DIY automation system…or you can use it just as a way to learn more about different parts of Qt/QML.

Setup

We start as usual where you need Qt/QML installed and for this article you will also need Google Cloud Speech-to-Text API setup for making required API calls which we will use to transcribe speech to text.

In this article we are going to use an API key for obvious simplicity reasons so make sure to understand what that implies and to follow security recommendations as described in official documentation if you plan to test the code!

Never distribute your app with API KEY builtin and if you are doing it for your home purposes use IP address limitation.

Speech to Text API

There is a number of ways to communicate with Google Cloud APIs and in particular with Speech API. We can go with Google Client libraries, HTTP Rest API or gRPC API.

Given that HTTP is easily supported in Qt without adding extra libraries, that will be our choice as that makes for an easy cross platform support.

Note however that REST interface doesn’t support streaming recognition so if that is your requirement you should go with another interface.

REST API method

Google Cloud Speech-to-Text REST Interface is nicely documented so one just has to decide which methods to call. As we are going to have just one call at a time we are going to use recognize method so our call in turn will be like:

POST https://speech.googleapis.com/v1/speech:recognize

This is marked as a synchronous method in the API which in this case means that processing of the audio transcription will happen during usual HTTP request-response flow so in our response we are immediately going to get recognized audio content as text. This fits our use case as we are going to have only short messages up to few seconds which we will wait for and we want to get results soon as possible.

In the request body we have to provide text in JSON format containing two objects:

{
"config": {
object (RecognitionConfig)
},
"audio": {
object (RecognitionAudio)
}
}

config will hold a set of configuration values regarding our audio recording, while audio will be audio recording itself in appropriate format. Simple right?

Audio formats

One quite obvious thing is that we need to make a recording and upload to an API. However, in which format do we make this recording and what audio codec do we use?

For that we first look at what is supported by the API, and in general we have:

For best results, the audio source should be captured and transmitted using a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, and MP3.

Then we look at what is supported by Qt and that we get from QAudioRecorder which we are going to use, in particular its supportedAudioCodecs method. There is no predefined list, as Qt integrates with native APIs of each platform and depends on what is provided, so some platforms will have more codecs supported while others will have less.

Having same format supported on both sides obviously makes it possible to just ‘pass-through’ our recording file to the API avoiding any need for transcoding.

In our particular case we are going to go with audio/x-flac on our Linux installation as that is what we are targeting, while on Android for example one could use audio/amr-nb as FLAC is not natively supported.
You can also go with AMR on all platforms if you have that supported, however speech recognition in my experience worked the best with FLAC codec, as also noted in the Google API docs.

Client

On the client side we are going to use native Qt classes to do required work. There is no QML interfaces for all of the required actions so we are going to use C++ interfaces and expose relevant methods to QML.

This is actually very convenient as then it is possible to use all existing C++ and Qt libraries and extend our QML funcionality with whatever we need.

Audio recording

First things first, we obviously have to make an audio recording to be able to transcribe anything. As mentioned we are going to QAudioRecorder which uses QAudioEncoderSettings for obviously managing recording settings.

The usage is going to be as simple as:

audioSettings.setCodec("audio/x-flac");
audioSettings.setSampleRate(16000);
audioSettings.setQuality(QMultimedia::VeryHighQuality);
audioRecorder.setEncodingSettings(audioSettings);
audioRecorder.setOutputLocation(filePath);

To make a recording we start the process withaudioRecorder.record() and if there was no error it will end up at location specified with withaudioRecorder.setOutputLocation.

Network requests (HTTP/2)

For network requests we are going to use standard QNetworkRequest and QNetworkAccessManager. For JSON we use QJSONDocument and relevant classes to provide required configuration and audio file (which has to be provided as a base64-encoded string):

QJsonDocument data {                                                                                                                                                                                               
QJsonObject { {
"audio",
QJsonObject { {"content", QJsonValue::fromVariant(fileData.toBase64())} }
}, {
"config",
QJsonObject {
{"encoding", "FLAC"},
{"languageCode", "en-US"},
{"model", "command_and_search"},
{"sampleRateHertz", audioSettings.sampleRate()}
}}
}
};

Note that we chosen specific model: command_and_search which is described as Best for short queries such as voice commands or voice search.

We will read our recorder file using QFile and then we pass it together with a JSON data in a POST request:

request.setUrl(url);        
request.setHeader(QNetworkRequest::ContentTypeHeader, "application/json");
request.setAttribute(QNetworkRequest::HTTP2AllowedAttribute, true);
qam.post(request, data.toJson(QJsonDocument::Compact));

We are also using a benefit of HTTP/2 which is supported by both Google APIs and Qt which you can read more about in Google docs.

If you want to confirm HTTP/2 usage you should check for QNetworkRequest::HTTP2WasUsedAttribute in the reply.

To get our results we simply read returned JSON result from QNetworkReply:

connect(&qam, &QNetworkAccessManager::finished, [this](QNetworkReply *response)
{
auto data = QJsonDocument::fromJson(response->readAll());
response->deleteLater();
auto error = data["error"]["message"];

if (error.isUndefined()) {
auto command = data["results"][0]["alternatives"][0]["transcript"].toString();

setRunning(false);
setCommand(command);

} else {
setRunning(false);
setError(error.toString());
}

If you get errors or empty responses check https://cloud.google.com/speech-to-text/docs/troubleshooting.

Connecting Qt C++ and QML

There are many ways of enriching QML with custom Qt C++ methods.

QML and C++ Integration

As usual, official documentation has you covered better than I can as that will always be up to date.

Simplest approach would be exposing our C++ instance as a context property in QML since we don’t need more than one instance. However, we are going to create a new QML type that can be instantiated as that will be your usual case.

We are going to need one method to start our recording / translation process which we will simply call start .

Then we want to know when that translation process has finished and what is our final translation value, which we will cover by having a running Boolean property. command string property will hold last successful command and error property will hold an error string, if it happened.

As a final convenience we are also going to make it possible to change our target audio duration from QML by exposing recordDuration property.

Q_INVOKABLE void start()                                                                                                                                                                                           
{
setError("");
setCommand("");
setRunning(true);
audioRecorder.record();
}
Q_PROPERTY(int recordDuration READ getRecordDuration WRITE setRecordDuration NOTIFY recordDurationChanged)
Q_PROPERTY(QString command READ getCommand NOTIFY commandChanged)
Q_PROPERTY(QString error READ getError NOTIFY errorChanged)
Q_PROPERTY(bool running READ getRunning NOTIFY runningChanged)

To make a method which can be called from QML we prefix it with Q_INVOKABLE and for member variables we want to expose we use Q_PROPERTY macro.

In the end we register our new type with:

qmlRegisterType<VoiceTranslator>("org.pkoretic.voicetranslator", 1, 0, "VoiceTranslator");

QML

For all of the properties we are going to use the power of QML property binding so we can make our UI update appropriately when any of those value changes.

First we import our new type and create a new instance

import org.pkoretic.voicetranslator 1.0ApplicationWindow {
...
VoiceTranslator { id: translator }
}

Then we can develop our QML interface as usual by using translator methods.

translator.start() is used to start full transcribing process, recording, uploading and getting the results. translator.running property gives information that process is running, and translator.command is filled when there is a command ready. Default value of recording duration is set to 2500ms but that can be easily changed with translator.recordDuration property. translator.error will be non empty if an error happened.

Label {                                                                                                                                                                                                            
id: txInfo
text: qsTr("Press OK to say command")
font.italic: true
anchors.horizontalCenter: parent.horizontalCenter

opacity: translator.running ? 0 : 1
Behavior on opacity { NumberAnimation {} }
}

Button {
id: btTranslate
text: qsTr("OK")
icon.name: "audio-input-microphone"
anchors.horizontalCenter: parent.horizontalCenter
Keys.onEnterPressed: clicked()
Keys.onReturnPressed: clicked()
onClicked: translator.start()
enabled: !translator.running
focus: true

opacity: translator.running ? 0 : 1
Behavior on opacity { NumberAnimation {} }
}

BusyIndicator {
id: busyIndicator
running: translator.running
anchors.horizontalCenter: parent.horizontalCenter

opacity: translator.running ? 1 : 0
Behavior on opacity { NumberAnimation {} }
}

Label {
id: lCommand
text: translator.command
font.bold: true
anchors.horizontalCenter: parent.horizontalCenter

opacity: translator.command.length > 0 && translator.running ? 0 : 1
Behavior on opacity { NumberAnimation {} }
}

Label {
id: lError
text: translator.error
color: "red"
anchors.horizontalCenter: parent.horizontalCenter
}
it ain’t much but it’s honest work

Wrapping up

There is obviously more happening which you can see in the full source code.

One could also add a status and state properties which would update during each of the steps which you can usually see in Qt classes and which is what we also use to develop this example with QAudioRecorder, but that is left as an exercise for the reader.

Note that Google Speech-to-Text is very powerful and supports many languages (both programming and spoken :). However, if you don’t need this you can always go with offline solutions or use interesting alternatives like Mozilla DeepSpeech engine.

Finally, this is just a first step. What to do on the command? Show a forecast, make a new TODO, or integrate further with your home devices using MQTT or KNX.

Happy talking.

Developer by day, architect at night — never satisfied

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store