The HTML5 person's crawler will parse some significant fraction of real websites, and the XHTML one won't, because people write HTML5 and not XHTML, even if you as a tool vendor would greatly prefer otherwise.
And yet, forcing people to implement opengraph tags, forcing people to drop flash, forcing people to use HTTPS, forcing people to drop Symantec certs, forcing people to drop SHA1 certs – so often the actors behind the WHATWG have managed to get website authors to change what they use.
There' s a Big difference between "every major website" and "the Web".
"every major website" means 100 companies with skilled developer who can and will react to changes in browsers quickly.
"The web" consists of millions of websites maintained by individuals and small organisations who have no resources to update the way their web pages are coded every year. It contains HTML generated 10, 20, soon 30 years ago. It contains that one app in your intranet with the table layout that you can't replace and that IOT thing you connected to your home Wi-Fi 7 years ago that has no way of upgrading its web interface.
A browser that looses access to "the web" is worst than useless.