I majorly compared it to the native Explorer agents (for example in claude code). So far it has won against the explorer agents in 98 of 100 cases. I am already in the works to create a bigger benchmark, but did not have so much time for it. But you are welcome to test it out :)