Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Huh, spaces. There's way too much software, especially on Windows, that breaks when there are Cyrillic characters in a path. I'll let you guess how I found out.


I had a really odd one last year where a Grave I ( well known brand name) got converted by office/excell into a Double Grave I.

The double grave I is used by some obscure orthodox religionious texts


A friend had the username "Rubén" and jfc it broke everything other than windows itself xD


The problem isn't the Cyrillic or the é but the fact that Windows lets you put those characters in file names in non-Unicode encodings which will create sequences of bytes which are invalid UTF-8. It's 2021, FFS, stop using legacy encodings.


All win32 functions that accept or return strings come in two varieties, with A and W suffixes, MessageBoxA/MessageBoxW. The A works with the system default 8-bit encoding (cp1251 in case of Cyrillic), the W works with unicode in wide chars. There shouldn't be much of a problem with string handling if you stick exclusively with W functions.


Using the W functions has been the advice from Microsoft's documentation for ages. But people still use the A functions because they're easier, especially when writing cross-platform software since Windows is the only major OS that made the unfortunate choice of having the base character type 16 bits wide.

Fortunately the future of the Windows API does look better since Microsoft has now added proper UTF-8 support since Win 10 1904. All you have to do is request it in the application manifest and the A functions will accept and return UTF-8.


> since Windows is the only major OS that made the unfortunate choice of having the base character type 16 bits wide

Apple OSes use something they call "unichar" inside NSStrings. I'm not 100% sure what it is, but it feels like it's the same 16-bit wide character.


It's possible! It seemed like a sensible choice back in the early 90s when the answer to making a system for global use was UCS-2. I know Java was another one that went with that decision.


I would rather they added a U suffixed version and better still backported that all the way to Win 7. Now in 3-7 years people can write programs that use the A functions, but have to check the version of Windows and refuse to run if it isn't new enough.


There’s been some talk of repurposing the A variants to work on UTF-8


> All you have to do is request it in the application manifest and the A functions will accept and return UTF-8.

They really should have gone with WTF-8 [0] since the W functions generally accept WTF-16 and not just the valid UTF-16 subset.

[0] https://simonsapin.github.io/wtf-8/




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: