New Mac Pro might have 32 core M1..

JohnBooty · Dec 8, 2020

blueone said:
Good answers, but I disagree with this one. If Apple laptops are quieter and have significantly longer battery life than x86 laptops, pound for pound, people will consider them better.

Macs will have those advantages like you say, and some will switch for sure!

It's very much an open question how many people that will be. It will be people who can spend over $1K on a laptop, need the extra performance watt, appreciate the all-day battery life, don't need the "openness" of the PC hardware ecosystem, don't have AAA games as a primary focus, and/or aren't locked into any Windows-only apps.

There are a *lot* of people who fit that profile, but a lot of them are using Macs already.

(And all of that is only if Apple manages to communicate those advantages to consumers effectively....)

To be clear, this is undeniably a big win for Apple! It is better for them to have those advantages than to not have them. But we'll see how things play out...

blueone said:
Microsoft will try to get Intel and AMD to respond

AMD and Intel are already locked in a decades-long battle with each other to build faster and more efficient chips. I'm not sure they really have a higher gear to shift into!

blueone said:
and if they won't or can't I suspect Microsoft will be looking at what actions they can take. (Partnering with or buying a CPU company.)

They're already doing this!

Before Macs were (publicly) running "Apple Silicon", the Surface Pro X tablets from Microsoft have been powered by "Microsoft SQ1 or SQ2 ARM processors co-developed by Qualcomm" for a few years now.

The difference is that they were just not able to execute like Apple, frankly.

Why? Certainly not for a lack of smart people working at MS. But I think it was simply not a "bet the company" thing like it is for Apple. The advancements in Apple's custom CPUs was more or less funded directly by the absolutely intergalactic amounts of cash pouring into the company from the iPhone/iPad tablets powered by those same CPUs. It is their #1 focus. The same can't remotely be said for MS.

At this point, there is no viable path for Microsoft to directly counter Apple here in the next few years. Certainly not on the ARM side of things. Apple is quite a few years ahead of the rest of the ARM manufacturers.

But I think AMD is not *too* far behind. The 4000-series Ryzen CPUs can equal the M1 chips in multicore benchmarks. Just not with the same power efficiency.

blueone · Dec 8, 2020

JohnBooty said:
Macs will have those advantages like you say, and some will switch for sure!

It's very much an open question how many people that will be. It will be people who can spend over $1K on a laptop, need the extra performance watt, appreciate the all-day battery life, don't need the "openness" of the PC hardware ecosystem, don't have AAA games as a primary focus, and/or aren't locked into any Windows-only apps.

There are a *lot* of people who fit that profile, but a lot of them are using Macs already.

(And all of that is only if Apple manages to communicate those advantages to consumers effectively....)

To be clear, this is undeniably a big win for Apple! It is better for them to have those advantages than to not have them. But we'll see how things play out...

Macs are about 14.5% of the laptop and desktop market this year. Maybe a bit higher. I agree that Macs are a relatively premium product, but the M1 architecture improves price performance in addition to the battery life and fan-less advantages, it wouldn't surprise me if Apple picks up some market share. Of course, a 30% increase in share would only put Macs at about 19% of the market, so there's a lot of growth potential.

JohnBooty said:
AMD and Intel are already locked in a decades-long battle with each other to build faster and more efficient chips. I'm not sure they really have a higher gear to shift into!

As I pointed out I'm another thread, a substantial part of the M1's advantage IMO is integrating DRAM into the SoC package. It does reduce memory capacity, but increases memory performance and power efficiency by not needing DDR links. I'm anxious to see if Intel and/or AMD copies this innovation. On the CPU core front, they are still stuck with their inefficient micro-ops architectures. I think both companies just got a huge kick in the butt for a large segment of the market. I doubt Apple will challenge i9 systems for gamers with massive GPUs and 64GB of DRAM, but that's a relative niche, and Apple is first getting started.

JohnBooty said:
They're already doing this!

Before Macs were (publicly) running "Apple Silicon", the Surface Pro X tablets from Microsoft have been powered by "Microsoft SQ1 or SQ2 ARM processors co-developed by Qualcomm" for a few years now.

The difference is that they were just not able to execute like Apple, frankly.

Why? Certainly not for a lack of smart people working at MS. But I think it was simply not a "bet the company" thing like it is for Apple. The advancements in Apple's custom CPUs was more or less funded directly by the absolutely intergalactic amounts of cash pouring into the company from the iPhone/iPad tablets powered by those same CPUs. It is their #1 focus. The same can't remotely be said for MS.

Agreed, but the Qualcomm effort was pedestrian, to say the least. MSFT obviously wasn't taking this seriously enough.

JohnBooty said:
At this point, there is no viable path for Microsoft to directly counter Apple here in the next few years. Certainly not on the ARM side of things. Apple is quite a few years ahead of the rest of the ARM manufacturers.

Agreed.

JohnBooty said:
But I think AMD is not *too* far behind. The 4000-series Ryzen CPUs can equal the M1 chips in multicore benchmarks. Just not with the same power efficiency.

I think AMD is farther behind than you do.

JohnBooty · Dec 8, 2020

blueone said:
The emulation layer, Rosetta 2, is software within the MacOS. Microsoft could do something similar, if it wished.

Yes, but unless Microsoft is already working on this behind closed doors (which *is* possible) they would be years behind Apple if they started today.

blueone said:
This isn't my understanding. There have been some baseless speculation in the industry press, but nothing solid. Do you have evidence of M1 hardware support for Rosetta?

There's a few layers involved with how they pull this off. FWIW, it's ahead-of-time translation, not emulation. The first time you run an x86 app on Apple Silicon, the OS translates (a.k.a. recompiles, or transpiles) it for Apple Silicon. The difference is somewhat meaningless for this discussion, but I'll be referring to translated binaries and not emulation in the name of correctness and consistency with the sources I'll be linking to.

The way Apple pulled it off is pretty slick, and requires close collaboration between the hardware and transpiler teams. There is a fundamental difference between the ways in which ARM CPUs and x86 CPUs handle memory accesses. As Apple says,

A weak memory ordering model, like the one in Apple silicon, gives the processor more flexibility to reorder memory instructions and improve performance, but doesn’t add implicit memory barriers. To ensure the correctness of your code on both platforms, add explicit synchronization primitives to your code.

If the x86-to-ARM transpiled binaries produced by Rosetta had to handle that difference in software, the overhead at runtime would be enormous. Every time the transpiled code needed to access memory it would need to jump through a bunch of extra hoops and execute a bunch of extra instructions.

So Apple took a bold step. Apple Silicon chips support the x86 memory access model in hardware. Each process can choose whether it wants this aspect of the CPU's operation to use the ARM or x86 model and, naturally, the translated binaries produced by Rosetta take full advantage of this.

https://github.com/saagarjha/TSOEnabler

https://www.infoq.com/news/2020/11/rosetta-2-translation/

So, are the translated x86 binaries "hardware accelerated" on Apple Silicon? Yes, absolutely. A very fundamental feature of the chips exists for the sole purpose of speeding up this sort of thing.

Could Microsoft do this?

I honestly do not know. On the software side, sure. But to do it as well as Apple, the CPU would have to be architected for it to an extent. I'm sure Qualcomm or somebody could do it if Microsoft threw enough money at them, but even then, the fact is that Apple's custom ARM chips have been a few years ahead of all their ARM rivals in terms of performance ever since the iPhone debuted. Maybe, if sufficiently motivated, Microsoft could match the performance of Apple's 2020 chips in 2023 or something like that.

restorer-john · Dec 8, 2020

JohnBooty said:
Man, is there any greater feeling than getting a family member onto something trouble-free, so you can stop being "tech support?"

So true. I have to say that my various "friends" and "family" tech support calls/call-outs have dropped by about 95% thanks to Windows10. The only things that get broken are old hardware drivers being made "legacy" devices and needing work arounds. But most of them are gone thankfully.

bigjacko · Dec 8, 2020

I am not expert but why suddenly Apple is able to pull off a cpu that is even better than AMD and intel? Is it because ARM has fundamental advantage to x86 or is it just Apple is too good in this game? If Apple has such a good cpu, is it going to appear in desktop computer? Would Apple become a widely known cpu maker very soon (or already is)?

stevenswall · Dec 8, 2020

JohnBooty said:
So there are big wins for people who don't care about raw screaming CPU performance as well. Even your Aunt Sally who only uses Facebook and Zoom will appreciate a cooler, quieter laptop that she doesn't need to plug in as often.

Until every app runs and loads like a button click in a video game, I think the CPU improvement is great. I wish they'd do only 4x4 MIMO wifi now though, and that the internet also operated as fast as the user could click.... and laptops should boot in two seconds or less to compete with cellphone unlocking, but not require sleeping that kills the battery and then you pull it out after a weekend and it's dead.

SoundAndMotion · Dec 8, 2020

bigjacko said:
I am not expert but why suddenly Apple is able to pull off a cpu that is even better than AMD and intel? (1)Is it because ARM has fundamental advantage to x86 or is it just Apple is too good in this game? (2) If Apple has such a good cpu, is it going to appear in desktop computer? (3) Would Apple become a widely known cpu maker very soon (or already is)(4)

1- It's not sudden. It's been planned and in the news for years. The current version, M1, is based on the A14, the 14th generation of Apple CPUs (that have been used in iPads and iPhones).
2- It's because Apple makes both the OS and the CPU and they've optimized each for the other... among other things, perhaps, for example ARM.
3- It already appears in the Mac Mini, and it will appear in the iMac and Mac Pro in '21 and/or '22. They won't sell it to other computer manufacturers.
4- They are already known in the tablet/phone world, and it seems likely they'll receive lots of attention for their laptop/desktop usage in the coming years.

hyperplanar · Dec 8, 2020

JohnBooty said:
There's a few layers involved with how they pull this off. FWIW, it's ahead-of-time translation, not emulation. The first time you run an x86 app on Apple Silicon, the OS translates (a.k.a. recompiles, or transpiles) it for Apple Silicon. The difference is somewhat meaningless for this discussion, but I'll be referring to translated binaries and not emulation in the name of correctness and consistency with the sources I'll be linking to.

Just wanted to add, perhaps this is a bit obvious, but this doesn't apply to JIT code, which is emulated at run time. Most apps barely incur a performance penalty thru Rosetta 2 (seems to be around 70-80% perf of native, a good amount of which is presumably attributable to the M1's switchable memory ordering model), but an x86-64 version of Firefox/Chrome/Electron/etc. incurs a much larger penalty in the JavaScript engine.

Overall, extremely impressed by the performance/battery life/thermals/acoustics of the new Macs! I think the writing was on the wall for whoever's been following Apple's CPU development the past few years, but the actual resulting products so far have exceeded my expectations a tad. My 2012 15" Retina MacBook Pro is aching for retirement and will be laid to rest once the ARM 16" Pro is out

Curious to see how they'll handle scaling up from the M1. Chiplets, or will they stick with a monolithic die? Will the Mac Pro still have user-expandable RAM and support standard dedicated PCIe GPUs?

blueone · Dec 8, 2020

JohnBooty said:
Yes, but unless Microsoft is already working on this behind closed doors (which *is* possible) they would be years behind Apple if they started today.

I completely agree.

JohnBooty said:
There's a few layers involved with how they pull this off. FWIW, it's ahead-of-time translation, not emulation. The first time you run an x86 app on Apple Silicon, the OS translates (a.k.a. recompiles, or transpiles) it for Apple Silicon. The difference is somewhat meaningless for this discussion, but I'll be referring to translated binaries and not emulation in the name of correctness and consistency with the sources I'll be linking to.

Thanks for that link. It's good because it's an Apple link, and not some wanna-be techie who doesn't know he's talking about. Actually, the first use of real-time compilation in a production product I'm aware of was in relational databases. Complex queries were translated, optimized, rewritten to a complied language, the object code saved in a repository, and then the query was executed directly, not interpreted. One example of this which comes to mind is ParAccel, which is supposedly a technology basis for Amazon's Redshift cloud data warehousing database. (Or at least was at one time.) Perhaps there are other products which used real-time compilation I'm just unaware of.

Personally, I think the real-time compilation and repository of the saved object files are a good portion of the reason Rosetta works as well as it does. Contrast this to the way JVMs work.

[Now that I think of it, there was also an academic project from the 1990s called Bubba, which was done (I think) at the Microelectronics and Computer Technology Corporation, which was one the USA's paranoid responses to the Japanese 5th Generation Computer project in late 1980s. Bubba was interesting because it took an entire transaction, including the application logic and the database access, created an on-the-fly C language source program of the transaction, and then compiled it. Pretty interesting stuff, but I digress. ;-)

boral.tkde90.bubba.pdf (See page 13 for the real-time compilation description.) ]

Unfortunately the Apple link, good as it was, gives no hint of hardware assists for Rosetta 2 either.

JohnBooty said:
The way Apple pulled it off is pretty slick, and requires close collaboration between the hardware and transpiler teams. There is a fundamental difference between the ways in which ARM CPUs and x86 CPUs handle memory accesses. As Apple says,

Now this Apple link is really interesting, thanks again, and details why you need explicit support for multi-threading in your application, if you use it, and if you don't do a good job coding for multi-threading the run-time results can be non-deterministic in different ways depending on the underlying CPU hardware. I'm totally bought in, and as someone who early in his career made a very embarrassing multi-threading mutual exclusion mistake in some highly multi-threaded code, and then accused a certain mainframe provider of having a bug in their proprietary OS, I learned that lesson completely.

Nonetheless, there's still no evidence from this link of hardware support for transpiling from x86-64 in M1.

JohnBooty said:
If the x86-to-ARM transpiled binaries produced by Rosetta had to handle that difference in software, the overhead at runtime would be enormous. Every time the transpiled code needed to access memory it would need to jump through a bunch of extra hoops and execute a bunch of extra instructions.

Not necessarily, if the source rewriting is excellent and the resulting new executables are stored for future use.

JohnBooty said:
So Apple took a bold step. Apple Silicon chips support the x86 memory access model in hardware. Each process can choose whether it wants this aspect of the CPU's operation to use the ARM or x86 model and, naturally, the translated binaries produced by Rosetta take full advantage of this.

https://github.com/saagarjha/TSOEnabler

https://www.infoq.com/news/2020/11/rosetta-2-translation/

So, are the translated x86 binaries "hardware accelerated" on Apple Silicon? Yes, absolutely. A very fundamental feature of the chips exists for the sole purpose of speeding up this sort of thing.

I'm still not convinced that Apple supports two memory models in M1 hardware. The GitHub description never mentions hardware at all, other than referencing that enabling the TSO function restricts your code to the four high performance cores. That could mean hardware support, or it just could be MacOS support. So far I've seen nothing from Apple that actually says they have some sort of x86-64 specific hardware in M1, and I still doubt it. Especially because this TSO support is really only required for poorly-written multi-threaded code. Not using locks and latches in multi-threaded code is dumb. ARM architecture even supports an equivalent to Intel's transactional memory feature (TSX), so even that sort of very sophisticated Intel-specific programming could be translated to ARM.

Just out of curiosity, because I'm ignorant of his background, who is Robert Graham, and why should I pay any attention to him?

JohnBooty · Dec 8, 2020

blueone said:
Nonetheless, there's still no evidence from this link of hardware support for transpiling from x86-64 in M1.

???? Are we talking about that step, now? The transpiling only happens once (not every launch, just first launch) and it's already pretty fast. If we are designing a CPU, why would we dedicate part of our precious transistor budget to that? Let's spend those transistors making things fast at actual runtime. I figure that's what we're really talking about here, isn't it?

blueone said:
Not necessarily, if the source rewriting is excellent and the resulting new executables are stores future use.

No, I don't believe that's a correct assumption. Compilers can't know everything about runtime behavior at compilation/transcompilation time. We don't know ahead of time whether some random memory access is going to run afoul of a mismatch in memory ordering assumptions. If we did, we we wouldn't lean on guarantees from x86 CPUs in the first place; we'd just write smarter compilers.

blueone said:
Especially because this TSO support is really only required for poorly-written multi-threaded code. Not using locks and latches in multi-threaded code is dumb.

You misunderstand. In many cases, you can get by without them on x86, because the CPU itself makes more guarantees for you about memory accesses. Please think about how much slower x86 code would be (whether optimized by hand, or written by a compiler) if every memory access was wrapped in unnecessary locks and latches. That would be "dumb."

blueone said:
I'm still not convinced that Apple supports two memory models in M1 hardware. The GitHub description never mentions hardware at all,

It sure does. There are multiple references to the register that enables TSO, right there in the readme. Perhaps confusion stems from the fact that the readme just says "register" and not "hardware register." However, that is what they are referring to. There aren't really other sorts of registers besides hardware registers, and generally this means registers in the CPU itself.

But don't take my word for it. Let's look at the source code, which clearly shows.... j/k, I can't understand it either.

Let's see what Joseph Groff, Senior Swift Compiler Engineer at Apple says.

The A12 only supported TSO on the performance cores. The M1 supports it on all cores

He sure thinks it's supported at the processor level. That guy probably knows what he's talking about.

Over time, I suspect we'll find out about more silicon-level decisions that are amenable to the performance of Rosetta-transpiled code. I bet there's some more x86 stuff implemented in silicon on Apple Silicon.

blueone said:
Actually, the first use of real-time compilation in a production product I'm aware of was in relational databases. Complex queries were translated, optimized, rewritten to a complied language, the object code saved in a repository, and then the query was executed directly, not interpreted. One example of this which comes to mind is ParAccel, which is supposedly a technology basis for Amazon's Redshift cloud data warehousing database. (Or at least was at one time.) Perhaps there are other products which used real-time compilation I'm just unaware of.

Yeah, this is a super interesting area. I wish I had the time, the chops, and/or the reason to play with it more.

Since antiquity, the query planners on relational databases have been able to cache execution plans in order to speed up subsequent executions of the same query. Did ParAccel actually go a step further and compile things down to *native* code? That is both interesting and slightly bizarre to me. Typically your queries and stored procedures aren't really doing processing on their own -- they're just fetching and writing rows, things that are already implemented in native code.

The most common everyday example of JIT'ing would be the Javascript engines in modern browsers. V8 and its brethren compile bits of Javascript to native x86 (or whatever) code when it makes sense to do so.

Ruby has a really neat/hilarious hack for JIT'ing down to native code. The Ruby team is quite small. Writing a single native Ruby-to-metal compiler, much less multiple ones for multiple architectures, is simply out of the question. So they came up with a hack. There are already nice C compilers for every platform. So their "mJIT" translates Ruby to C, and then uses the system's C compiler (gcc, LLVM, whatever) to translate *that* into native code. I think a few other language teams have started looking at that approach.

I believe native code generation via JIT happens in modern games. The shaders used are written in, I think, C? And they are compiled down to whatever it is that the GPU natively needs at runtime. I'm fuzzy on this, but I know some games have had issues where these compiled shaders weren't properly cached and were unncessarily regenerated time and time again.

blueone · Dec 8, 2020

JohnBooty said:
???? Are we talking about that step, now? The transpiling only happens once (not every launch, just first launch) and it's already pretty fast. If we are designing a CPU, why would we dedicate part of our precious transistor budget to that? Let's spend those transistors making things fast at actual runtime. I figure that's what we're really talking about here, isn't it?

You misinterpreted my statement, or I was unclear. I was talking about hardware support for TSO as hardware support for transpiling.

JohnBooty said:
No, I don't believe that's a correct assumption. Compilers can't know everything about runtime behavior at compilation/transcompilation time. We don't know ahead of time whether some random memory access is going to run afoul of a mismatch in memory ordering assumptions. If we did, we we wouldn't lean on guarantees from x86 CPUs in the first place; we'd just write smarter compilers.

The TSO problem only occurs for multi-threaded code, whether specific sections of code are simultaneously executed in multiple threads. It does not apply to multiple functionally different multiple threads, which is what most applications do. For single-threaded sequential code ARM and x86-64 are functionally identical. They have to be, or one or the other could not be used in financial systems, or any sort of inventory management system, and we both know ARM and x86-64 can be used in those applications. The entire discussion here is about sharing, and just poorly coded sharing.

JohnBooty said:
You misunderstand. In many cases, you can get by without them on x86, because the CPU itself makes more guarantees for you about memory accesses. Please think about how much slower x86 code would be (whether optimized by hand, or written by a compiler) if every memory access was wrapped in unnecessary locks and latches. That would be "dumb."

Now you're exaggerating. Only shared memory areas are affected, so mentioning "every memory access" is not reasonable.

JohnBooty said:
It sure does. There are multiple references to the register that enables TSO, right there in the readme. Perhaps confusion stems from the fact that the readme just says "register" and not "hardware register." However, that is what they are referring to. There aren't really other sorts of registers besides hardware registers, and generally this means registers in the CPU itself.

Using a register for communication isn't proof.

JohnBooty said:
But don't take my word for it. Let's look at the source code, which clearly shows.... j/k, I can't understand it either.

Let's see what Joseph Groff, Senior Swift Compiler Engineer at Apple says.

https://www.linkedin.com/in/joseph-groff-271734106/

https://twitter.com/i/web/status/1332045390057639939

He sure thinks it's supported at the processor level. That guy probably knows what he's talking about.

I dunno, I think you're basing the conclusions on indirect evidence. I did learn, however, that Rosetta 2 is supported on all cores on the M1, which I thought I read wasn't the case. Nonetheless, I surrender. I'm not convinced, but I don't really have evidence for lack HW support either, so let's drop this for now.

JohnBooty · Dec 8, 2020

blueone said:
so let's drop this for now.

We'll have to drop it permanently, I'm afraid.

I linked to a senior Apple compiler engineer telling us, in no uncertain language, that:

"The A12 only supported TSO on the performance cores. The M1 supports it on all cores"

You're unwilling or unable to understand the subject matter, it seems.

Which is fine. Life goes on. Not everybody has to understand everything. I sure don't. I doubt I'll figure out how to fold a fitted sheet before I die.

New Mac Pro might have 32 core M1..

JohnBooty

Addicted to Fun and Learning

blueone

Major Contributor

JohnBooty

Addicted to Fun and Learning

restorer-john

Grand Contributor

bigjacko

Addicted to Fun and Learning

stevenswall

Major Contributor

SoundAndMotion

Active Member

hyperplanar

Senior Member

blueone

Major Contributor

JohnBooty

Addicted to Fun and Learning

blueone

Major Contributor

JohnBooty

Addicted to Fun and Learning