With an open system/engine, you can train your own personal speech model. For kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), you can do so without all that much difficulty, although the process/documentation could certainly use improvement.
I bootstrapped my personal speech model by retaining the commands from me using WSR. My voice is quite abnormal, and it took only 10 hours of speech data to train a model orders of magnitude more accurate than any generic model I've ever used. And of course, I retain much of my usage now with Kaldi, so my model improves more and more over time. A virtuous flywheel!
"Everything other than talon has terrible latency": False! I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend, which has extremely low latency. You can adjust how aggressive the VAD (voice activity detection) is to suit your preference, but the speech engine latency can be almost negligible, especially for voice commands (vs prose dictation). However, I agree that "most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands", and that low latency is pivotal to being productive with voice commands. I also agree with your other points.
I built a similar app using a Kaldi's nnet3 model running embedded; the thing was so responsive that our demo to an SVP went sideways: when he gave a query, the app responded nearly immediately after the sentence ended. The SVP did not realize it already responded, as the expectation for voice interaction systems was that it takes like 2-5 seconds to get an answer, which made the impression that the system did not work properly.
So, moral of the story, if you do a too good job of making a fast speech engine, especially for multi-turn dialogues, add some delays so it resembles human dialogue more.
I have been coding entirely by voice for approximately 10 years now (by hand long before that). Most of that time I have been using the Dragonfly (https://github.com/dictation-toolbox/dragonfly) library to construct my own customized voice coding system. The library is highly flexible and open source, allowing you to easily customize everything to suit what you need to be productive. It is perhaps the power user analogue to Dragon Naturally Speaking. With it, you can certainly be highly productive coding by voice. However, it does require work to setup and customize to suit you, so it isn't really for the "general population" of computer users to just sit down and use. With regard to accuracy of speech recognition, being open allows you to (with sufficient motivation) to train a custom acoustic speech model that recognizes your voice specifically extremely well.
Regarding the software packages you referenced: Yes, Dragon is trash that I want nothing to do with, because of its inefficient interface, its complete inability to accurately understand my voice, and its generally shoddy software quality. Voice Computer (which I hadn't seen before) is therefore eliminated as well, though it doesn't look terrible as a front end to Dragon to better use the OS GUI-accessibility info. Many people like Talon, but I demand something open, which I can modify to suit my needs.
Yep, I have been using my Kaldi backend through Dragonfly exclusively ever since I got v0.1.0 working.
I bootstrapped writing it initially using the Dragonfly WSR (windows speech recognition) backend, because that gave me the best accuracy out of the available options at the time. All of my development of it since the initial working version has been done using each previous version, so now it is basically bootstrapped itself. My productivity skyrocketed once I switched to Kaldi, due to being able to use my custom trained speech model just for my voice for orders of magnitude better accuracy, plus dramatically lower latency. (And it freed me from being dependent on closed software out of my control.)
I bootstrapped my personal speech model by retaining the commands from me using WSR. My voice is quite abnormal, and it took only 10 hours of speech data to train a model dramatically more accurate than any generic model I've ever used. And of course, I retain much of my usage now with Kaldi, so my model improves more and more over time. A virtuous flywheel!
I have been coding entirely by voice for approximately 10 years now (by hand long before that). Most of that time I have been using the Dragonfly (https://github.com/dictation-toolbox/dragonfly) library to construct my own customized voice coding system. The library is highly flexible and open source, allowing you to easily customize everything to suit what you need to be productive. It is perhaps the power user analogue to Dragon Naturally Speaking. With it, you can certainly be highly productive coding by voice. In fact, I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend usable by Dragonfly, itself entirely by voice. There's also a community of voice coders using Dragonfly and other tools that build on top of it, such as Caster (https://github.com/dictation-toolbox/Caster).
I want to point out that the Dragonfly library has backends other than the closed/commercial Dragon, and there has been significant progress expanding beyond it recently. I have been developing kaldi-active-grammar [0] for the past couple years as a fully open source, modifiable, cross-platform, and free alternative to Dragon when used with Dragonfly. While KaldiAG misses some of the user-friendly niceties of Dragon, these are least helpful when coding by voice. And KaldiAG has benefits such as lower latency that improves coding by voice tremendously.
I created KaldiAG because I didn't trust relying on closed source software for something so crucial to my productivity, where a decision by an outside party determines whether I can function. As a bonus, open source means I can make it work better to fit my needs than closed source ever could.
You raise good points. For what it's worth, I think all "invalidated" samples are still included in the distribution (invalidated.tsv), with the number of up and down votes for each (but not the reasoning).
Google is certainly doing some great work with this, both Project Euphonia and other research [0]. However, as far as I know, the Euphonia dataset is closed and only usable by Google. A Common Voice disordered speech dataset would (presumably) be open to all, allowing independent projects and research. (I would love to have access to such a dataset.)
Gathering, collecting, and publishing such a dataset would be great, and would certainly much improve the baseline speech recognition for people with disordered speech, but it can only help so much without personalizing to a specific individual. This is true for anybody, but more so for disordered speech. This is an area where I think "generic" solutions will inevitably struggle, even if they are somewhat specialized on "generic disarthritic" speech.
However, this means that the gains to be had from personalized training are greater for disordered speech than for "average" speech. I develop kaldi-active-grammar [0], which specializes the Kaldi speech recognition engine for real-time command & control with many complex grammars. I am also working on making it easier to train personalized speech models, and to fine tune generic models with training for an individual. I have posted basic numbers on some small experiments [1]. Such personalized training can be time consuming (depending on how far one wants to take it), but as my parent comment says, disabled people may need to rely more on ASR, which means they have that much more to gain by investing the time for training.
Nevertheless, a Common Voice disordered speech dataset would be quite helpful, both for research, and for pre-training models that can still be personalized with further training. It is good to see (in my sibling comment) that it is being discussed.
I develop kaldi-active-grammar [0]. The Kaldi engine itself is state of the art and open source, but is focused on research rather than usability. My project has a simple interface and comes with a pretty good open source speech model.
However, kaldi-active-grammar specializes in real time command and control, with advanced features that don't really apply to your use case. Vosk [1] is probably a simpler, better fit for you. It likewise uses Kaldi and can use my models, and offers some others of its own as well.
Neither are particularly focused on transcription per se, but they are open.
https://github.com/daanzu/kaldi-active-grammar