Coqui TTS is actually deprecated, the company shut down. I have a voice assistant that is using gpt-5.4 and opus 4.6 using the subsidized plans from Codex and Claude Code, and it uses STT and TTS from mlx-audio for those portions to be locally hosted: https://github.com/Blaizzy/mlx-audio
Here are the following models I found work well:
- Qwen ASR and TTS are really good. Qwen ASR is faster than OpenAI Whisper on Apple Silicon from my tests. And the TTS model has voice cloning support so you can give it any voice you want. Qwen ASR is my default.
- Chatterbox Turbo also does voice cloning TTS and is more efficient to run than Qwen TTS. Chatterbox Turbo is my default.
- Kitten TTS is good as a small model, better than Kokoro
- Soprano TTS is surprisingly really good for a small model, but it has glitches that prevent it from being my default
But overall the mlx-audio library makes it really easy to try different models and see which ones I like.
Do you know which HA integration I would use if I want to try out Qwen 3 ASR in HA? Some screenshots in the OP reference Qwen 3 ASR for STT but I can't seem to find any reference to which integration I'd use.
If you want your comments to sound more human — stop using em dashes everywhere. LLMs love them — along with neat structure, “furthermore”-style transitions, and perfectly balanced paragraphs.
Humans write a bit messier — commas, short sentences, abrupt turns.
I think em-dashes were once a reliable indicator (though never proof), but recent models have been fine-tuned to use them much less. Lots of recent AI-generated writing I've seen doesn't have em-dashes. Meanwhile, I've heard many people say that they naturally use em-dashes, and were already and/or are afraid of being accused of AI; so ironically this rumor may be causing people to use their own voice less.
Before, I naturally used hyphens as if they were em-dashes. The kerfuffle over LLM use of em-dashes motivated me to figure out how to type them properly (and configure my system to make that easier). Now I even go over old writing to fix the hyphens.
The RTX 5090 only has 32gb of VRAM. So the tradeoff is NVIDIA is for blazing speed in a tiny memory pool, but Apple Silicon has a larger memory pool at moderate speed.
I'm running a local voice agent on a Mac Mini M4. Qwen ASR for STT and Qwen TTS on Apple Silicon via MLX, Claude for the LLM. No API costs besides the Claude subscription but the interesting part is the LLM is agentic because it's using Claude Code. It reads and writes files, spawns background agents, controls devices, all through voice.
The insights about VAD and streaming pipelines in this thread are exactly what I'm looking at for v2. Moving to a WebSocket streaming pipeline with proper voice activity detection would close the latency gap significantly, even with local models.
Do you think it would be possible in the future to maybe add developer settings to enable or disable certain features, or to switch to other sandboxing methods that are more lightweight like Apple seatbelt for example?
Yup it uses Apple Virtualization framework for virtualization. It makes it so I can't use the Claude Cowork within my VMs and that's when I found out it was running a VM, because it caused a nested VM error. All it does is limit functionality, add extra space and cause lag. A better sandbox environment would be Apple seatbelt, which is what OpenAI uses, but even that isn't perfect: https://news.ycombinator.com/item?id=44283454
I don’t have an opinion on how they should handle the nested VMs probably, but I very much disagree that Seatbelt is better. Claude Code (aka `claude`) uses it, and it’s barely good for anything.
Out of curiosity, why are you running Cowork inside a VM in the first place? What does that get you that letting Cowork use its own VM wouldn’t?
OpenAI Codex CLI was able to use it effectively, so at least AI knows how to use it. Still, its deprecated and not maintained, Apple needs to make something new soon.
I've been running something similar for a few months, which is a voice-first interface for Claude Code running on a local Flask server. Instead of texting from my phone, I just talk to it. It spawns agents in tmux sessions, manages context with handoff notes between sessions, and has a card display for visual output.
The remote control feature is cool but the real unlock for me was voice. Typing on a phone is a terrible interface for coding conversations. Speaking is surprisingly natural for things like "check the test output" or "what did that agent do while I was away."
The tmux crowd in this thread is right that SSH + tmux gets you 90% of the way there. But adding voice on top changes the interaction model. You stop treating it like a terminal and start treating it like a collaborator.
- a _lot_ of people still use the VS Code extension and so we're still putting energy toward keeping it polished (this becomes easier with checks : ))
- our checks product is powered by an open-source CLI (we think this is important), which we recommend for jetbrains users
- the general goal is the same: we start by building tools for ourselves, share them with people in a way that avoids creating walled gardens, and aim to amplify developers (https://amplified.dev)
The friction didn't disappear with AI tools. It just shifted. It's now more so about knowing when to trust an AI system versus when to dig into things yourself. The key insight is this: don't devalue learning things on your own. AI is a tool, but if the tool messes up, you need other tools in your toolbox. If you've only ever leaned on the AI, you're in trouble the moment it fails on something subtle.
If you give an LLM enough context, it writes in your voice. But it requires using an intelligent model, and very thoughtful context development. Most people don't do this because it requires effort, and one could argue maybe even more effort than just writing the damn thing yourself. It's like trying to teach a human, or anyone, how to talk like you: very hard because it requires at worst your entire life story.
Something that freaked me out a little bit is that I've now written enough online (i.e.: HN comments) that the top models know my voice already and can imitate it on request without having to be fed any additional context.
There's a data centre somewhere in the US running additions and multiplications through a block of numbers that has captured my voice.
Why does this even matter? If it can say something more eloquently, in less stilted way something what I wanted to say, adding some interesting nuance on the way, while still sounding close to me - why not? I meanwhile, can learn one-two rhetorical tricks from LLMs reading the result.
A analogy: For the same reason why natural wood is more beautiful than plastic. Natural wood gets its beauty from little faults and irregularities. The process of growing a tree takes a long time and thereby is more valuable. A plastic facsimile can be made to look similar and be cheaper to produce, but it lacks the unique grain and quality of the wood.
It’s not just the end product that matters. The process and intent behind its genesis matters too.
Here are the following models I found work well:
- Qwen ASR and TTS are really good. Qwen ASR is faster than OpenAI Whisper on Apple Silicon from my tests. And the TTS model has voice cloning support so you can give it any voice you want. Qwen ASR is my default.
- Chatterbox Turbo also does voice cloning TTS and is more efficient to run than Qwen TTS. Chatterbox Turbo is my default.
- Kitten TTS is good as a small model, better than Kokoro
- Soprano TTS is surprisingly really good for a small model, but it has glitches that prevent it from being my default
But overall the mlx-audio library makes it really easy to try different models and see which ones I like.
reply