I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).
Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case?
I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.
Radio Yerevan: A listener asks: "Is it true that in Moscow, on Red Square, they are giving away cars?"
Our answer: "Yes, it is true. Except it isn't in Moscow, but in Leningrad. And it isn't on Red Square, but on Palace Square. And they aren't cars, but bicycles. And they aren't giving them away, they are stealing them."
I really hate modern time schedule. It's nightmare to be forced to get up 6am or 7am every workday since childhood. The only relief is natural wakeup on weekend.
The proof will be more friendly to nowadays programmers if we treat all "Gödel numbers" as bytecode of a programming language.
It's trivial that functions like "prove" and "subst" can be implemented based on abilities like bytecode parsing and expression tree manipulation.
reply