Categories
Linux Raspberry Pi

Interpreting speech with a Raspberry Pi

Or the beginning or creating your own smart speaker

Intro

Imagine you could use a low-cost device to interpret speech without the aid of the big cloud services and their complexity and security and big-brotherly-ness. Well if you have a DIY mindset, you can!

I wanted to control the raspberry pi-based slideshow I have written about many times in the past with voice commands. The question became How could I do it and is it even possible at all? And would I need to master the complex apis provided by either Amazon or Google cloud services? Well, it turns out that it is possible to do passable speech to text without any external cloud provider; and I am very excited to share what I’ve learned so far.

Equipment

raspberry pi 4 (even my old RPI 3 seems to work)

USB microphone

Raspberry Pi OS

Skills

basic linux and python skills are required

vosk – your main tool

I’m going to cut to the chase and just tell you that the vosk api is how I got this all working, but not before I drove into several dead-ends.

Here are the vosk installation instructions, which do work on RPi:

Vosk Installation (alphacephei.com)

It will be helpful to install and test the examples:

git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
python3 ./test_simple.py test.wav

On my RPi 4 it took 36 s the first time, and 6.6 s the second time to run this test.wav. So I got worried and fully expected it would be just too slow on these underpowed RPi systems.

But I forged ahead and looked for an example which could do real-time speech-to-text. They provide a microphone example. It requires some additional packages. But even after installing them it still produced a nasty segmentation fault. So I gave up on that. Then I noticed an ffmpeg-based example. Well, turns out I have lots of prior ffmpeg experience as I also post about live recording of audio with the raspberry pi.

It turns out their example was simply to use ffmpeg to interpret a file, but I didn’t know that to begin with. But I know my way around ffmpeg that I could use it for processing a lvie stream. So I made those changes, and voila. I’m glad to say I was dead wrong about the processing speed. On the RPi 4 it can keep up with its text-to-speech task in real time!

Basic program to examine your speech in real time

I developed the following python script based off one of the python examples from the api. I call it drjtst4.py, just to give it a name:

#!/usr/bin/env python3

import subprocess
import re
from modules import aux_modules

from vosk import Model, KaldiRecognizer, SetLogLevel

SAMPLE_RATE = 16000

SetLogLevel(0)

model = Model(lang="en-us")
rec = KaldiRecognizer(model, SAMPLE_RATE)
start,start_a = 0,0
input_device = 'plughw:1,0'
phrase = ''
accumulating = False
# wake word hey photo is often confused with a photo by vosk...
wake_word_re = '^(hey|a) photo'

with subprocess.Popen(["ffmpeg","-loglevel", "quiet","-f","alsa","-i",
                            input_device,
                            "-ar", str(SAMPLE_RATE) , "-ac", "1", "-f", "s16le", "-"],
                            stdout=subprocess.PIPE) as process:

    while True:
        data = process.stdout.read(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            print('in first part')
            print(rec.Result())
            text = rec.PartialResult()
# text is a "string" which is basically a dict
            start,start_a,accumulating,phrase = aux_modules.process_text(wake_word_re,text,start,start_a,accumulating,phrase)
        else:
# this part always seems to be executed for whatever reason
            print('in else part')
            text = rec.PartialResult()
            start,start_a,accumulating,phrase = aux_modules.process_text(wake_word_re,text,start,start_a,accumulating,phrase)
            print(rec.PartialResult())

# we never seem to get here
    print(rec.FinalResult())
    print('In final part')
    text = rec.FinalResult()

I created a modules directory and in it a file called aux_modules.py. It look like this:

import re,time,json

def process_text(wake_word_re,text_s,start,start_a,accumulating,phrase):
    max = 5.5 # seconds
    inactivity = 10 # seconds
    short_max = 1.5 # seconds
    elapsed = 0
    if time.time() - start_a < inactivity:
# Allow some time to elapse since we just took an action
        return start,start_a,accumulating,phrase
# convert text to real text. Real text is in 'partial'
    text_d = json.loads(text_s)
    text = ''
    if 'partial' in text_d:
        text = text_d['partial']
    if 'text' in text_d:
        text = text_d['text']
    if not text == '': phrase = text
    if re.search(wake_word_re,text):
        if not accumulating:
            start = time.time()
            accumulating = True
            print('Wake word detected. Now accumulating text.')
    l = len(re.split(r'\s',text))
    print('text, word ct',text,l)
    if accumulating:
        elapsed = time.time() - start
        print('Elapsed time:',elapsed)
        if l > 1:
           phrase = text
    if elapsed > max or (elapsed > short_max and l == 1):
# we're at a natural ending here...
        print('This is the total text',phrase)
# do some action
# reset everything
        accumulating = False
        phrase = ''
        start_a = time.time()
    return start,start_a,accumulating,phrase

And you just invoke it with python3 drjtst4.py.

Sample session output
in else part
text, word ct 1
{
"partial" : ""
}
in else part
text, word ct hey 1
{
"partial" : "hey"
}
in else part
text, word ct hey 1
{
"partial" : "hey"
}
in else part
text, word ct hey 1
{
"partial" : "hey"
}
in else part
Wake word detected. Now accumulating text.
text, word ct hey photo 2
Elapsed time: 0.0004639625549316406
{
"partial" : "hey photo"
}
in else part
text, word ct hey photo 2
Elapsed time: 0.003415822982788086
{
"partial" : "hey photo"
}
in else part
text, word ct hey photo 2
Elapsed time: 0.034906625747680664
{
"partial" : "hey photo"
}
in else part
text, word ct hey photo 2
Elapsed time: 0.09063172340393066
{
"partial" : "hey photo"
}
in else part
text, word ct hey photo 2
Elapsed time: 0.2488384246826172
{
"partial" : "hey photo"
}
in else part
text, word ct hey photo 2
Elapsed time: 0.33771753311157227
{
"partial" : "hey photo"
}
in else part
text, word ct hey photo place 3
Elapsed time: 0.7102789878845215
{
"partial" : "hey photo place"
}
in else part
text, word ct hey photo place 3
Elapsed time: 0.7134637832641602
{
"partial" : "hey photo place"
}
in else part
text, word ct hey photo player 3
Elapsed time: 0.8728365898132324
{
"partial" : "hey photo player"
}
in else part
text, word ct hey photo player 3
Elapsed time: 0.8759913444519043
{
"partial" : "hey photo player"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.0684640407562256
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.0879075527191162
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.3674390316009521
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.3706269264221191
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.5532972812652588
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.5963218212127686
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.74298095703125
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.842745065689087
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 1.9888567924499512
{
"partial" : "hey photo play slideshow"
}
in else part
text, word ct hey photo play slideshow 4
Elapsed time: 2.0897343158721924
{
"partial" : "hey photo play slideshow"
}
in first part
{
"text" : "hey photo play slideshow"
}
text, word ct 1
Elapsed time: 2.3853299617767334
This is the total text hey photo play slideshow
in else part
{
"partial" : ""
}
in else part
{
"partial" : ""
}
A word on accuracy

It isn’t Alexa or Google. No one expected it would be, right? But if you’re a native English speaker it isn’t too bad. You can see it trying to correct itself.

The desire to choose an uncommon wake word of three syllables is at direct odds with how neural networks are trained! So… although I desired my wake word to be “hey photo,” I also allow “a photo.” A photo was probably in their training set whereas Hey photo certainly was not. Hence the bias against recognizing a unique wake word. And no way will I re-train their model – way too much effort. But to lower false positives this phrase has to occur at the beginning of a spoken phrase.

Turning this into a smart speaker

You can see I’ve got all the pieces set up. At least I think I do! I’ve got my wake word. I don’t have natural language processing but I think I can forgo that. I have a place in the code where I print out the “final text.” That’s where the spoken command is perceived to have been uttered and and a potential action could be exectured at that point.

Dead ends

To be fleshed out later as time permits.

Conclusion

I have demonstrated how speech-to-text without use of complex cloud apis such as those provided by Amazon and Google can be easily achieved on an inexpensive raspberry pi.

I will be building on this facility in subsequent posts as I turn my RPi-powered slideshow into a slideshow which reacts to voice commands!

References and related

Vosk Installation (alphacephei.com)

Raspberry Pi slideshow

This conference USB mic works really well for me.