Dlexa the Hedge Dragon
Dlexa HomeDlexa's BodyDlexa's MindFantasy Story for DlexaArtificial Inteligence is MathBuild Notes and Logs

Anatomy of the Dlexa's Mind

Software Links

The lexa mind is a sequence of three large programs:
1st:  Speech-to-Text (STT),
2nd: Mycroft (database lookup),
3rd: Text-to-Speech (TTS).

Each of the sections is a separate application that can be run either locally or using a cloud service. The sections communicate with the IP address of the application and its port number. So you could use the free Google cloud SST, a local Mycroft that you manage, and a cloud based Mozilla TTS.


The process starts when the ears in the lexa body detect a request: often a key phrase like "Hello Alexa" or "Hey Google" or when a kind of "always on" setting with anything spoken loud enough for the lexa to hear. The lexa decides how long the request is by a section of quiet: the kind humans put at the end of a sentence. If the lexa does not hear an end point it arbitrarily truncates the request (the number of milliseconds is an "ears" setting).

You can already see problems with this. If you have several people talking at the same time not only is the audio corrupted because it only works on a single speaker but there is no obvious end point for the sentence length. The truncation time would chop the request in mid sentence.


The request packet is set to a speech-to-text application. This is a very sophisticated neural net that has be trained to recognize phonetic units called phonemes. It usually takes several phonemes to make a syllable and several syllables to make a word. When it finds a word it adds it to the text packet and continues processing the audio stream.

Early attempts at STT required that it be trained on a single person and that person had to speak with a pause between each word. Speaking -- that -- way -- was -- very -- hard. The first advance was continuous speech but the STT still had to be trained on a single person. Of course each advance required larger neural nets and faster computers.

The current STT applications are not trained to a specific speaker and can turn a conversation between people into a script with the people tagged. A:hello B:haven't "scene" you around in a "log" time. A:three "mumps" actually. B:don't you like this place? A:I've "bin" away on a big "contact". The words in quotes are errors in the STT. If you watch closed captioning on news stations you'll see these kinds of errors.

Speaker independent STT is so compute intensive is cannot be done on a regular computer. Lucky the gaming industry has been very demanding and creating games that require massive numbers of graphics calculations so the grass on the field waves in the wind and the body parts move exactly as they would in a real explosion. The industry solution was to put thousands of tiny math calculators on a graphic card.

Nvidia has most of the market share in graphics cards and has standardized their interface with the tiny math units (called cores). CUDA (Compute Unified Device Architecture) works on a whole range of their cards and ten year old cards still work with current math libraries like Tensorflow. Pytorch is a popular math library that allows with a python language calls but it refuses to work on older cards. Another case of some things work and some do not and discovering this can take weeks of failure.

If you want to train an SST on a new language (it turns out that Klingon has already been done) you need a card with many thousand CUDA cores and at least 8 GBytes of memory. To run an STT locally you only need a card with a few hundred cores and less than 1 GB of memory. Or you can use a cloud based SST and communicate with it with any old PC.


You have the audio version of the request converted (sort of) to text and the Mycroft application tries to provide a response. So if the request was "What is the weather tomorrow?" Mycroft will break that request into key words like "query" "weather" "Tuesday, January 10 2023", "Colwood BC Canada", "metric" and look that up in a skill that does weather queries with words from the request as parameters. In that skill it will form an internet request and get return text like "mostly cloudy and 10 celsius". It then passes that text to the TTS to create an voice response.

Now not all the text that Mycroft gets from the STT makes sense. It might get "what is the average temperature for next mump". That is because the person said "month" and the SST processed that as "mump". Mycroft also has to take all the words it gets from the SST and check each word against all those that sound alike and sort through the a hundreds of possible sentences to get the request that is statistically most likely. Since "mump" [category disease] is less likely than "month" [category time] it makes the correction. This is not as computational hard as the job the SST does it still needs a completely idle multi-cpu system when it starts working on the request.


So the STT heard the request, got it mostly right and passed the text on. Mycroft fixed up the some of the bizzare words and pulled data from the internet to satisfy the request then put that data into a sentence format that humans would find familiar. It then passed that sentence to the Text-to-Speech application.

TTS systems have been around for decades. Stephen Hawking used a TTS that DEC invented in 1983 and sold as DECtalk. The device had several speakers to choose from and Hawking used "Perfect Paul". The voice sounds very robotic but you can download a version that will run on your PC if you want to play with it.

It turns out it is easy to get electronics to make sounds that humans can understand a speech but to get something you would think was spoken by a real person is extremely compute intensive. Like STT the solution is a large neural net running a graphics card. You can tell it is not human but it is really close.

A written sentence is not the way a human speaks. So the TTS has to break the sentence it got from the Mycroft in phrasing that mimics the human flow or cadence. It then formats that in phonemes to make a syllables that strung together sound like human speech.

Different people speak with a different accents: a different flow of phonemes and slightly altered phonemes. That's what makes a cockney different to a southern drawl. You can train a TTS to sound like anyone.

Again you need a graphics card with thousands of CUDA cores and about 8GB of memory. You just feed it about five hours of recorded audio made up of several thousand sentences using a vocabulary of several thousand words. In a couple of days it puts out neural net (several hundred megabytes) with all the parameters set.

For example if the TTS was trained on your mother voice you could feed that neural net a sentence and it will sound like your mother. There is a whole catalog of these canned neural nets to speak like famous people or characters from famous games.

If you like the sound of GLaDOS from the game Portal you can have your lexa speak like that.


All that remains is to pass the digital audio from the TTS to a speaker in the lexa body. The details of this are in Voice but the brief version is the STT creates a WAV file which is sent down an ethernet cable to a router in the heart (the communications and electrical hub) of Dlexa.

The data travels through a WiFi link to a microcontroller which converts it and drives a class-D amplifier. The amplifier drives a DML (distributed mode loudspeaker) transducer stuck on a chunk of extruded polystyrene (Foamular building insulation). The foam sheet is enclosed in a frame in dragon's chest for Dlexa and for Pearl the foam is painted to look like a gold plate and hangs on a frame like a ancient gong.

And now you have the full chain of actions from a someone making a request at the microphone to their answer coming from a mysterious somewhere (that's the DML effect).


Now that you have some understanding of the hardware and software needed by a lexa brain you can go to Builds and read the gory details of installing the chain of technologies.