I have the same problem with conceiving how that can possibly work on a probabilistic basis. Especially when the code is so well structured and documented.
However, I also concede that I have about the same understanding of how LLM's work on a detailed level, as my three year old grandson does about the inner workings of the laptop I am using to write this.
We are a few levels above simple probabilistic predictions these days (although the arguments about the core still being probabilistic and potentially unreliable are still valid).
At the simplest level, when you type "pr" it is almost 100% probable that you are going to type something like print("some string: ", some_value) that's basic autocomplete as we had it a few years ago.
Now if I type a comment such as # here we compute the area of the rectangle, it is easy for the LLM to predict what I want is
def ComputeRectangleArea ( width, height )
area = width * height
return area
In this case, it doesn't predict the next characters one by one, it predicts that you will want to use a code structure it has already seen. (this is why sometimes copyrighted non open source code appears verbatim, same mechanism as the one that leads to NYT article fragments being quoted, etc)
Putting a larger program together is just more of the same "query our database for the dimensions of rectangles we have in stock, compute their area and price, match with customer request, create a PDF with an offer" is just assembling standard blocks together.
Another step that may come into play is "reasoning" if in the above request I forgot to mention where the price per unit of area comes from, it may prompt you because it "realized" it could not complete a step such as "units of area * price per unit of area" that it has seen before / memorized.
Then, it will typically run the program/script you asked it to deliver in a virtual environment and deal with any error it encounters. If for example it gets the "you can't multiply a number with a string" message because it retrieved a string from the db, it will retrieve the pattern of converting strings to number from its training data, adjust the program with the missing conversion and deliver it to you when it is satisfied. And in doing so it may "notice" that such a query + conversion is usually accompanied by a sanity check (is the data available, can it be converted, etc...) and you suddenly get 200 lines of "clean" code with warnings, exceptions etc...
It is quite possible, in the example I gave above, that the probabilistic nature reared its ugly head in my 5000 lines of code that were build around plane trigonometry when it should have used spherical trigonometry. Whenever a user asks for stuff involving distances and angles, plane trigonometry is probably over-represented (more probable) in the training data. But what that also shows that the model doesn't have any deep understanding of what it is doing unless it has been explicitly prompted or has done extensive grounding such as externally checking that what I want him to do, even if I haven't explicitly stated so because it is obvious to me, usually happens on the surface of spheres.
In another example I had a script that ran an extremely complex model around which I wanted a user interface.
I first asked it to analyze the code and it correctly noted that I have some very heavy computations involving GPU and multiprocessing.
It did suggest valid optimizations (because it has seen hundreds of similar code segments, many of them better than mine).
Then it created a perfectly functional user interface around it (user interface code is the most boring, predictable and repetitive code possible).
The problem was that for each slider it added for parameters definitions, it implemented "live updates". Moving a slider from 0.8 to 0.9 would not only move granularly from 0.800 to 0.801 but also trigger a complete computation of the model. Again, even though the model noted that the computations were extremely heavy, it did not "understand" that updating the slider value live was a very bad idea. Again, this probably come from the over representation of the "slider with a live effect" in its training dataset - in other words its probabilistic nature predicted live updates was the most probable thing I wanted.
Now, if I am talking with a human who has no programming experience and tell him: in this step I do x billions operations in y minutes, he immediately understands that I will not be able to do 100x billions operations in 1/60 of a minute.
Talking with a LLM is totally different. It may immediately suggest a performance improvement that will allow me to do 1.5x billions of operations in y minutes and then, the next minute, attempt to to 100x in 1 second...