The lexa mind is a sequence of three large programs:
Each of the sections is a separate application that can be run either locally or using a cloud service.
The sections communicate with the IP address of the application and its port number.
So you could use the free Google cloud SST, a local Mycroft that you manage, and a cloud based Mozilla TTS.
1st: Speech-to-Text (STT),
2nd: Mycroft (database lookup),
3rd: Text-to-Speech (TTS).
The process starts when the ears in the lexa body detect a request: often a key phrase like
"Hello Alexa" or "Hey Google" or when a kind of "always on" setting with anything spoken loud
enough for the lexa to hear. The lexa decides how long the request is by a section of quiet: the kind humans
put at the end of a sentence. If the lexa does not hear an end point it arbitrarily truncates the request
(the number of milliseconds is an "ears" setting).
You can already see problems with this. If you have several people talking at the same time not only is
the audio corrupted because it only works on a single speaker but there is no obvious end point for the
sentence length. The truncation time would chop the request in mid sentence.
The request packet is set to a speech-to-text application. This is a very sophisticated neural net that has
be trained to recognize phonetic units called phonemes. It usually takes several phonemes to make a syllable and
several syllables to make a word. When it finds a word it adds it to the text packet and continues
processing the audio stream.
Early attempts at STT required that it be trained on a single person and that person had to speak with a pause
between each word. Speaking -- that -- way -- was -- very -- hard. The first advance was continuous speech but
the STT still had to be trained on a single person. Of course each advance required larger neural nets and
The current STT applications are not trained to a specific speaker and can turn a conversation between people into a script
with the people tagged. A:hello B:haven't "scene" you around in a "log" time. A:three "mumps" actually.
B:don't you like this place? A:I've "bin" away on a big "contact". The words in quotes are errors in the STT.
If you watch closed captioning on news stations you'll see these kinds of errors.
Speaker independent STT is so compute intensive is cannot be done on a regular computer. Lucky the
gaming industry has been very demanding and creating games that require massive numbers of graphics calculations so
the grass on the field waves in the wind and the body parts move exactly as they would in a real
explosion. The industry solution was to put thousands of tiny math calculators on a graphic card.
Nvidia has most of the market share in graphics cards and has standardized their interface with the
tiny math units (called cores). CUDA (Compute Unified Device Architecture) works on a whole range of their cards and
ten year old cards still work with current math libraries like Tensorflow. Pytorch is a popular math library
that allows with a python language calls but it refuses to work on older cards. Another case of some things
work and some do not and discovering this can take weeks of failure.
If you want to train an SST on a new language (it turns out that Klingon has already been done) you need a card with
many thousand CUDA cores and at least 8 GBytes of memory. To run an STT locally you only need a card with a few
hundred cores and less than 1 GB of memory. Or you can use a cloud based SST and communicate with it with any
You have the audio version of the request converted (sort of) to text and the Mycroft application tries to
provide a response. So if the request was "What is the weather tomorrow?" Mycroft will break that request into
key words like "query" "weather" "Tuesday, January 10 2023", "Colwood BC Canada", "metric" and look that up
in a skill that does weather queries with words from the request as parameters. In that skill it will form an internet request
and get return text like "mostly cloudy and 10 celsius". It then passes that text to the TTS to create an voice
Now not all the text that Mycroft gets from the STT makes sense. It might get "what is the average temperature for
next mump". That is because the person said "month" and the SST processed that as "mump". Mycroft also has to
take all the words it gets from the SST and check each word against all those that sound alike and sort through
the a hundreds of possible sentences to get the request that is statistically most likely. Since "mump" [category
disease] is less likely than "month" [category time] it makes the correction. This is not as computational hard
as the job the SST does it still needs a completely idle multi-cpu system when it starts working on the request.
So the STT heard the request, got it mostly right and passed the text on. Mycroft fixed up the some of the bizzare
words and pulled data from the internet to satisfy the request then put that data into a sentence format that
humans would find familiar. It then passed that sentence to the Text-to-Speech application.
TTS systems have been around for decades. Stephen Hawking used a TTS that DEC invented in 1983 and sold as
DECtalk. The device had several speakers to choose from and
Hawking used "Perfect Paul". The voice sounds very robotic but you can download a version that will run on
your PC if you want to play with it.
It turns out it is easy to get electronics to make sounds that humans can understand a speech but to get something
you would think was spoken by a real person is extremely compute intensive. Like STT the solution is a large
neural net running a graphics card. You can tell it is not human but it is really close.
A written sentence is not the way a human speaks. So the TTS has to break the sentence it got from the Mycroft
in phrasing that mimics the human flow or cadence. It then formats that in phonemes to make a syllables that
strung together sound like human speech.
Different people speak with a different accents: a different flow of phonemes and slightly altered phonemes.
That's what makes a cockney different to a southern drawl. You can train a TTS to sound like anyone.
Again you need a graphics card with thousands of CUDA cores and about 8GB of memory. You just feed it about five hours
of recorded audio made up of several thousand sentences using a vocabulary of several thousand words.
In a couple of days it puts out neural net (several hundred megabytes) with all the parameters set.
if the TTS was trained on your mother voice you could feed that neural net a sentence and it will sound like
your mother. There is a whole catalog of these canned neural nets to speak like famous people or characters
from famous games.
If you like the sound of GLaDOS from the game
Portal you can have your lexa speak like that.
All that remains is to pass the digital audio from the TTS to a speaker in the lexa body. The details of this
are in Voice but the brief version is the STT creates a WAV file which is sent
down an ethernet cable to a router in the heart (the communications and electrical hub) of Dlexa.
The data travels through a WiFi link to a microcontroller which converts it and drives a class-D amplifier. The
amplifier drives a DML (distributed mode loudspeaker) transducer stuck on a chunk of extruded polystyrene
(Foamular building insulation). The foam sheet is enclosed in a frame in dragon's chest for Dlexa and for Pearl
the foam is painted to look like a gold plate and hangs on a frame like a ancient gong.
And now you have the full chain of actions from a someone making a request at the microphone to their
answer coming from a mysterious somewhere (that's the DML effect).
Now that you have some understanding of the hardware and software needed by a lexa brain you can go to
Builds and read the gory details of installing the chain of technologies.