I think a big problem looming on the horizon for using llms to help with code is the same "confidently wrong" tone as when they're used as a general search engine. I've seen people blindly follow what they were being told by the llm when, if you'd just read it, it was obviously wrong code (or in one case a json config file). It reminds me of the copy/paste problem from stackoverflow except there's no voting or feedback from others that signify a more correct answer.
LLMs sound so sure of themselves and people think "well i'm dealign with the most advanced technology ever so it must be right...".
On the implementation side of things, it's hard for me to get the non-deterministic aspect of llms right in my head. I put an LLM and RAG system in prod with a team and went through rounds of the usual testing. 99 times it passed but on test 100 it would fail, so you'd adjust the system prompt. Then it'd pass 500 times and fail at 501. Adjust the system prompt, then it would pass 9 times and fail at 10. That system went to production but there's the low level worry in my mind, when is it going to fail to give the correct output? The fact that you can never guarantee the output of an LLM from a given input severely limits where they should be used IMO. I don't think it's wise to have the output of an LLM be the input to another program, there's no functional relationship between domain and range with an LLM.
That problem is usually met with "well, a human would make the same mistake.." but the reason computers exist is to do long, tedious, lists of tasks/instructions very fast that humans get wrong. Simulating a human, and all those imperfections, with digital logic seems contradictory to me.
edit: Also, just want to point out that the "testing" mentioned in my post was all manually done by humans. You can't automate testing the response of an llm unless you use another model to grade the response as correct or not but then you're right back to not being able to trust that the grader will always act consistently.
LLMs sound so sure of themselves and people think "well i'm dealign with the most advanced technology ever so it must be right...".
On the implementation side of things, it's hard for me to get the non-deterministic aspect of llms right in my head. I put an LLM and RAG system in prod with a team and went through rounds of the usual testing. 99 times it passed but on test 100 it would fail, so you'd adjust the system prompt. Then it'd pass 500 times and fail at 501. Adjust the system prompt, then it would pass 9 times and fail at 10. That system went to production but there's the low level worry in my mind, when is it going to fail to give the correct output? The fact that you can never guarantee the output of an LLM from a given input severely limits where they should be used IMO. I don't think it's wise to have the output of an LLM be the input to another program, there's no functional relationship between domain and range with an LLM.
That problem is usually met with "well, a human would make the same mistake.." but the reason computers exist is to do long, tedious, lists of tasks/instructions very fast that humans get wrong. Simulating a human, and all those imperfections, with digital logic seems contradictory to me.
edit: Also, just want to point out that the "testing" mentioned in my post was all manually done by humans. You can't automate testing the response of an llm unless you use another model to grade the response as correct or not but then you're right back to not being able to trust that the grader will always act consistently.