More

bachittle · 2026-03-16T20:51:46 1773694306

Coqui TTS is actually deprecated, the company shut down. I have a voice assistant that is using gpt-5.4 and opus 4.6 using the subsidized plans from Codex and Claude Code, and it uses STT and TTS from mlx-audio for those portions to be locally hosted: https://github.com/Blaizzy/mlx-audio

Here are the following models I found work well:

- Qwen ASR and TTS are really good. Qwen ASR is faster than OpenAI Whisper on Apple Silicon from my tests. And the TTS model has voice cloning support so you can give it any voice you want. Qwen ASR is my default.

- Chatterbox Turbo also does voice cloning TTS and is more efficient to run than Qwen TTS. Chatterbox Turbo is my default.

- Kitten TTS is good as a small model, better than Kokoro

- Soprano TTS is surprisingly really good for a small model, but it has glitches that prevent it from being my default

But overall the mlx-audio library makes it really easy to try different models and see which ones I like.

alias_neo · 2026-03-17T11:58:20 1773748700

Do you know which HA integration I would use if I want to try out Qwen 3 ASR in HA? Some screenshots in the OP reference Qwen 3 ASR for STT but I can't seem to find any reference to which integration I'd use.

bachittle · 2026-03-11T20:16:40 1773260200

If you want your comments to sound more human — stop using em dashes everywhere. LLMs love them — along with neat structure, “furthermore”-style transitions, and perfectly balanced paragraphs.

Humans write a bit messier — commas, short sentences, abrupt turns.

armchairhacker · 2026-03-11T20:26:02 1773260762

I think em-dashes were once a reliable indicator (though never proof), but recent models have been fine-tuned to use them much less. Lots of recent AI-generated writing I've seen doesn't have em-dashes. Meanwhile, I've heard many people say that they naturally use em-dashes, and were already and/or are afraid of being accused of AI; so ironically this rumor may be causing people to use their own voice less.

zahlman · 2026-03-11T22:14:02 1773267242

Before, I naturally used hyphens as if they were em-dashes. The kerfuffle over LLM use of em-dashes motivated me to figure out how to type them properly (and configure my system to make that easier). Now I even go over old writing to fix the hyphens.

bachittle · 2026-03-03T15:50:25 1772553025

The RTX 5090 only has 32gb of VRAM. So the tradeoff is NVIDIA is for blazing speed in a tiny memory pool, but Apple Silicon has a larger memory pool at moderate speed.

827a · 2026-03-03T16:26:47 1772555207

Or, there's the DGX Spark, which effectively neutralizes both of these trade-offs, and is the same price as the RTX 5090.

ofcrpls · 2026-03-03T17:06:14 1772557574

For reference, DGX Spark is at 273 GB/s

Keyframe · 2026-03-03T16:51:50 1772556710

It's not 5090 performance though.

bigyabai · 2026-03-03T17:56:03 1772560563

Nothing stops you from plugging in a 5090. Nvidia ships ARM64 GPU drivers.

Keyframe · 2026-03-03T18:50:45 1772563845

So, what were we talking about even then in the thread?

bachittle · 2026-03-03T12:32:57 1772541177

I'm running a local voice agent on a Mac Mini M4. Qwen ASR for STT and Qwen TTS on Apple Silicon via MLX, Claude for the LLM. No API costs besides the Claude subscription but the interesting part is the LLM is agentic because it's using Claude Code. It reads and writes files, spawns background agents, controls devices, all through voice.

The insights about VAD and streaming pipelines in this thread are exactly what I'm looking at for v2. Moving to a WebSocket streaming pipeline with proper voice activity detection would close the latency gap significantly, even with local models.

bachittle · 2026-03-02T17:01:27 1772470887

Do you think it would be possible in the future to maybe add developer settings to enable or disable certain features, or to switch to other sandboxing methods that are more lightweight like Apple seatbelt for example?

bachittle · 2026-03-02T14:56:20 1772463380

Yup it uses Apple Virtualization framework for virtualization. It makes it so I can't use the Claude Cowork within my VMs and that's when I found out it was running a VM, because it caused a nested VM error. All it does is limit functionality, add extra space and cause lag. A better sandbox environment would be Apple seatbelt, which is what OpenAI uses, but even that isn't perfect: https://news.ycombinator.com/item?id=44283454

ctmnt · 2026-03-02T17:32:53 1772472773

I don’t have an opinion on how they should handle the nested VMs probably, but I very much disagree that Seatbelt is better. Claude Code (aka `claude`) uses it, and it’s barely good for anything.

Out of curiosity, why are you running Cowork inside a VM in the first place? What does that get you that letting Cowork use its own VM wouldn’t?

j16sdiz · 2026-03-02T15:37:36 1772465856

seatbelt is largely undocumented.

bachittle · 2026-03-02T16:21:38 1772468498

OpenAI Codex CLI was able to use it effectively, so at least AI knows how to use it. Still, its deprecated and not maintained, Apple needs to make something new soon.

pluc · 2026-03-02T15:49:02 1772466542

just ask AI to document it

ramoz · 2026-03-02T15:59:14 1772467154

Not sure why you're getting down voted. This is totally reasonable.

bachittle · 2026-02-25T15:58:15 1772035095

I've been running something similar for a few months, which is a voice-first interface for Claude Code running on a local Flask server. Instead of texting from my phone, I just talk to it. It spawns agents in tmux sessions, manages context with handoff notes between sessions, and has a card display for visual output.

The remote control feature is cool but the real unlock for me was voice. Typing on a phone is a terrible interface for coding conversations. Speaking is surprisingly natural for things like "check the test output" or "what did that agent do while I was away."

The tmux crowd in this thread is right that SSH + tmux gets you 90% of the way there. But adding voice on top changes the interaction model. You stop treating it like a terminal and start treating it like a collaborator.

Here is a demo of it controlling my smart lights: https://www.youtube.com/watch?v=HFmp9HFv50s

bachittle · 2026-02-17T17:46:05 1771350365

Is this the same continue that was for running local AI coding agents? Interesting rebrand.

sestinj · 2026-02-17T18:02:11 1771351331

That's us! I figure others will wonder the same, so we wrote about what exactly we're doing here: https://blog.continue.dev/from-extension-to-mission-control

tl;dr

- a _lot_ of people still use the VS Code extension and so we're still putting energy toward keeping it polished (this becomes easier with checks : ))

- our checks product is powered by an open-source CLI (we think this is important), which we recommend for jetbrains users

- the general goal is the same: we start by building tools for ourselves, share them with people in a way that avoids creating walled gardens, and aim to amplify developers (https://amplified.dev)

bachittle · 2026-02-04T16:49:51 1770223791

The friction didn't disappear with AI tools. It just shifted. It's now more so about knowing when to trust an AI system versus when to dig into things yourself. The key insight is this: don't devalue learning things on your own. AI is a tool, but if the tool messes up, you need other tools in your toolbox. If you've only ever leaned on the AI, you're in trouble the moment it fails on something subtle.

bachittle · 2025-11-27T17:04:17 1764263057

If you give an LLM enough context, it writes in your voice. But it requires using an intelligent model, and very thoughtful context development. Most people don't do this because it requires effort, and one could argue maybe even more effort than just writing the damn thing yourself. It's like trying to teach a human, or anyone, how to talk like you: very hard because it requires at worst your entire life story.

jiggawatts · 2025-11-27T20:38:06 1764275886

Something that freaked me out a little bit is that I've now written enough online (i.e.: HN comments) that the top models know my voice already and can imitate it on request without having to be fed any additional context.

There's a data centre somewhere in the US running additions and multiplications through a block of numbers that has captured my voice.

simianparrot · 2025-11-27T17:10:33 1764263433

Why the f- would I train software to do my thinking and reasoning for me!?

BoredomIsFun · 2025-11-27T17:22:06 1764264126

It is not what training is, but with edgy attitude like yours, no one will want to give you their arguments.

ashton314 · 2025-11-27T17:08:42 1764263322

* it writes in an imitation of your voice.

BoredomIsFun · 2025-11-27T17:40:36 1764265236

Why does this even matter? If it can say something more eloquently, in less stilted way something what I wanted to say, adding some interesting nuance on the way, while still sounding close to me - why not? I meanwhile, can learn one-two rhetorical tricks from LLMs reading the result.

ashton314 · 2025-11-28T15:30:27 1764343827

A analogy: For the same reason why natural wood is more beautiful than plastic. Natural wood gets its beauty from little faults and irregularities. The process of growing a tree takes a long time and thereby is more valuable. A plastic facsimile can be made to look similar and be cheaper to produce, but it lacks the unique grain and quality of the wood.

It’s not just the end product that matters. The process and intent behind its genesis matters too.

BoredomIsFun · 2025-12-01T11:16:54 1764587814

Not every artist use wood though. Some prefer plastic as a medium.