As Snowden told us, video and audio recording capabilities of your devices are NSA spying vectors. OSS/Linux is a safeguard against such capabilities. The massive datacenter investments in US will be used to classify us all into a patriotic (for Israel)/Oligarchist social credit score, and every mega tech company can increase profits through NSA cooperation, and are legally obligated to cooperate with all government orders.
Speech to text and speech automation are useful tech, though always listening state sponsored terrorists is a non-NSA targeted path for sweeping future social credit classifications of your past life.
Some small LLMs that can be used for speech to text: https://modal.com/blog/open-source-stt


I mean, there are many. TTS and self-hosted automation are huge in the local LLM scene.
We even have open source “omni” models now, that can ingest and output speech tokens directly (which means they get more semantic understanding from tone and such, they ‘choose’ the tone to reply with, and that it’s streamable word-by-word). They support all sorts of tool calling.
…But they aren’t easy to run. It’s still in the realm of homelabs with at least an RTX 3060 + hacky python projects.
If you’re mad, you can self-host Longcat Omni
https://huggingface.co/meituan-longcat/LongCat-Flash-Omni
And blow Alexa out of the water with a MIT-licensed model from, I kid you not, a Chinese food delivery company.
EDIT
For the curious, see:
Audio-text-to-text (and sometimes TTS): https://huggingface.co/models?pipeline_tag=audio-text-to-text&num_parameters=min%3A6B&sort=modified
TTS: https://huggingface.co/models?pipeline_tag=text-to-speech&num_parameters=min%3A6B&sort=modified
“Anything-to-anything,” generally image/video/audio/text -> text/speech: https://huggingface.co/models?pipeline_tag=any-to-any&num_parameters=min%3A6B&sort=modified
Bigger than 6B to exclude toy/test models.
I do wish there was a smaller LongCat model available. My current AI node has a hard 16GB VRAM limit (yay AMD UMA limitations), so 27B can’t really fit. An 8B dynamically loaded model would fit, and run much better.