At some point we might need to cross-reference (X-ref) language models. This may involve using a dedicated data interchange format - perhaps an extensible markup language for language model cross-referencing. Or, in short, LMX-XML.
I know it is irrational - as compilers already perform optimizations that are non intuitive - but my lizard brain really doesn’t like the idea of ML possibly changing semantics at a layer that is unobservable.
This sort of work isn't going "i dunno, emit some new instructions based on a model and hope it is correct." ML techniques fall into one of two categories:
1. Decisions that have no semantic change in the program but affect performance. This is things like code layout or register allocation that will be more cache friendly. This is what I expect to show up in industrial optimizing compilers more and more over the next bunch of years.
2. Decisions where you can prove that the new code sequence is semantically identical to the original code sequence via some formal technique. This is basically the same as various superoptimization techniques but with a new search strategy and therefore no new concerns about correctness. I'd expect this to remain niche and only used offline to generate particularly optimized sequences in extremely hot and small code paths.
I think some functions with exhaustively enumerable could be good candidates as well (I guess somewhat rare though?). For example an Int->(...) or Float->(...) function (Doubles are not exhaustively enumerable though, I guess, since 2^64 ~= 10^19, 10,000-Peta items). If the programmer can give an error threshold, e.g. in bits of precision (maximum relative error), that should work quite well too.
There are a huge number of these things in compilers. I listed two: code layout and register allocation. And you've got plenty of other things like inlining, outlining, and unrolling decisions.
WE "just" need to get the models outputting machine checkable proofs that semantics are preserved. I wonder what like this are going on in the math scene.
Yeah, funny enough I was reading the “coding machines” short story yesterday after someone linked it and I had a bit of a giggle imagining the sentient compiler here.
My research (ML for binary function recognition) uses obfuscation and diverse compilation for data augmentation.
Intuitively, obfuscation is a form of anti-optimization. My hope is that it's differentiable, so perhaps given enough knowledge of obfuscation, an ML model can make corresponding optimizations to "undo" them, then for more performance, apply those same un-obfuscations to code that hasn't already been obfuscated.
I think this is a great line of research as it may solve the reasoning portion of AGI. The difficulty is that you would want to prove equivalence. LLMs can output stuff but for programs to execute you need to output correct stuff.
Most of the work I have read in this field is focused on finding potential malware or stolen copyrighted code, so false positives are no big deal. But you don't want anything less than 100% valid code coming out of your compiler.
There is a mathematical description of programming compilers I read once (think it was a survey paper) that I found very clear but I've never been able to find since. Basically defined a compiler as a type of function. Any good suggestions?