Nonetheless, there's still no evidence from this link of hardware support for transpiling from x86-64 in M1.
???? Are we talking about that step, now? The transpiling only happens once (not every launch, just first launch) and it's already pretty fast. If we are designing a CPU, why would we dedicate part of our precious transistor budget to that? Let's spend those transistors making things fast at actual runtime. I figure that's what we're really talking about here, isn't it?
Not necessarily, if the source rewriting is excellent and the resulting new executables are stores future use.
No, I don't believe that's a correct assumption. Compilers can't know everything about runtime behavior at compilation/transcompilation time. We don't know ahead of time whether some random memory access is going to run afoul of a mismatch in memory ordering assumptions. If we did, we we wouldn't lean on guarantees from x86 CPUs in the first place; we'd just write smarter compilers.
Especially because this TSO support is really only required for poorly-written multi-threaded code. Not using locks and latches in multi-threaded code is dumb.
You misunderstand. In many cases, you can get by without them on x86, because the CPU itself makes more guarantees for you about memory accesses. Please think about how much slower x86 code would be (whether optimized by hand, or written by a compiler) if every memory access was wrapped in unnecessary locks and latches.
That would be "dumb."
I'm still not convinced that Apple supports two memory models in M1 hardware. The GitHub description never mentions hardware at all,
It sure does. There are multiple references to the register that enables TSO, right there in the readme. Perhaps confusion stems from the fact that the readme just says "register" and not "hardware register." However, that is what they are referring to. There aren't really other sorts of registers besides hardware registers, and generally this means registers in the CPU itself.
But don't take my word for it.
Let's look at the source code, which clearly shows.... j/k, I can't understand it either.
Let's see what Joseph Groff, Senior Swift Compiler Engineer at Apple says.
The A12 only supported TSO on the performance cores. The M1 supports it on all cores
He sure thinks it's supported at the processor level. That guy probably knows what he's talking about.
Over time, I suspect we'll find out about more silicon-level decisions that are amenable to the performance of Rosetta-transpiled code. I bet there's some more x86 stuff implemented in silicon on Apple Silicon.
Actually, the first use of real-time compilation in a production product I'm aware of was in relational databases. Complex queries were translated, optimized, rewritten to a complied language, the object code saved in a repository, and then the query was executed directly, not interpreted. One example of this which comes to mind is ParAccel, which is supposedly a technology basis for Amazon's Redshift cloud data warehousing database. (Or at least was at one time.) Perhaps there are other products which used real-time compilation I'm just unaware of.
Yeah, this is a super interesting area. I wish I had the time, the chops, and/or the reason to play with it more.
Since antiquity, the query planners on relational databases have been able to cache execution plans in order to speed up subsequent executions of the same query. Did ParAccel actually go a step further and compile things down to *native* code? That is both interesting and slightly bizarre to me. Typically your queries and stored procedures aren't really doing processing on their own -- they're just fetching and writing rows, things that are already implemented in native code.
The most common everyday example of JIT'ing would be the Javascript engines in modern browsers. V8 and its brethren compile bits of Javascript to native x86 (or whatever) code when it makes sense to do so.
Ruby has a really neat/hilarious hack for JIT'ing down to native code. The Ruby team is quite small. Writing a single native Ruby-to-metal compiler, much less multiple ones for multiple architectures, is simply out of the question. So they came up with a hack. There are already nice C compilers for every platform. So their "mJIT" translates Ruby to C, and then uses the system's C compiler (gcc, LLVM, whatever) to translate *that* into native code. I think a few other language teams have started looking at that approach.
I believe native code generation via JIT happens in modern games. The shaders used are written in, I think, C? And they are compiled down to whatever it is that the GPU natively needs at runtime. I'm fuzzy on this, but I know some games have had issues where these compiled shaders weren't properly cached and were unncessarily regenerated time and time again.