Super weird benchmarks

avereveard · 2025-05-21T17:36:32 1747848992

from what I gather it's finetuned to use OpenHand specifically so shows value on thsoe benchmark that target a whole system as a blackbox (i.e. agent + llm) more than directly target the llm input/outputs

amarcheschi · 2025-05-21T22:54:35 1747868075

Yup the 1st comment says this https://www.reddit.com/r/LocalLLaMA/comments/1kryybf/mistral...