There are many measures of benchmark sizes. One important measure is size of codes that account for 99% of execution time. If your codebase is a million lines but your hotspot is a thousand lines, benchmark result is sensitive to optimization quirks and in some sense benchmark is small.
Anecdotal, but I've seen similar improvements over g++ in my code.