Then, of course, the M1 could do all sorts of fusion and stuff…. – same number of mispredicts? In my previous blog post, I compared the performance of my new ARM-based MacBook Pro with my 2017 Intel-based MacBook Pro. I would try to use debug tools to generate flame graphs, or river diagrams, of where each algorithm is spending its time. In some cases, the ARM-based MacBook Pro was nearly twice as fast as the older Intel-based MacBook Pro. Sounds like a good reason not to buy a Mac. memory aliasing/forwarding. Well that’s the point isn’t it? Note that 256b FP operations were added in AVX. IF you insist on the two points stipulated above, what’s left? When compared to X86-64 Intel chips, it’s clear that ARM is the best choice for low-power devices. Please refresh the page and try again. Intel CPUs have 3x 256-bit ports, not 2x. The decimal significand spans 17 digits. Les puces Intel actuellement livrées par Apple utilisent l'architecture Intel 64 bits, qui gère les instructions de calcul différemment des puces ARM à venir. Apple, this summer, announced that change is afoot in its Mac lineup. I like precise data points. Required fields are marked *. They then both crack these in different ways, then fuse the pieces in different ways. In November 2020, Apple began making a big change to its Mac lineup. Issue is of course way higher, but the important number is 6 wide fixed point issue. It contains no ARM-specific optimization. The Skylake microarchitecture first appeared in 2015, so ARM-based Mac hardware has clearly been in the research and development phase for a … Update. Apple promised some degree of help, but fell well short of guaranteeing Rosetta will keep all applications running once it completes the transition. Basically where I’m coming from is that this stuff isn’t magic; there are reasons Apple achieve their 2+x IPC. gives one a start in asking what’s limiting performance. I am compiling both benchmarks identically, using Apple builtin’s Xcode system with the LLVM C++ compiler. How do they compare? While the compiler will spit out some SIMD here and there where it can, SPECfp is uses general use-case code without such hand-crafted vectorisation, and as such the performance uplift and impact is very minor. The primary benefits of the ARM chips are in their simplified construction. The programs you use most often should be supported for years, just like they were when Apple transitioned from PowerPC to Intel machines. But if you have one that you can squeeze an extra year or two out of, it’s quite likely that the new ARM chips are going to be worth the wait. I honestly do not know what to think at this point. A typo, I meant has 2 ports for Floating Point operations. I am aware of the Neural Engine but I considered it to be outside of the scope of this blog post. mispredicts. In fact, the MacBook Pro, MacBook Air, iMac Pro, Mac mini, and Mac Pro are already equipped with Apple-designed Arm processors, in the form of … He is a techno-optimist. How to buy Dogecoin — the cryptocurrency fueled by Elon Musk's tweets. If the most common dependency chains are (to guess numbers) around 150 instructions long, and x86’s issue queue is 100 instructions long while Apple’s is 200 long, then Apple can always be running two dependency chains in parallel, while most of the time Intel is operating on only one of them. A newly leaked benchmark shows Apple's ARM-based A14X Bionic processor outperforming an Intel i9-powered MacBook Pro by a healthy margin. That might provide some insight into commonalities and differences in the underlying libraries and functions. It’s like comparing dogs to cats, or macOS to Windows. I am not new to ARM… I had an AMD ARM server…. Yes, I’ve read that page, several times in fact. And bulk discounts aside, Apple stands to save quite a bit of money by building its own, once research and development costs are recouped, that is. The M1, like most modern ARM v8 CPUs, uses the NEON SIMD extension. They aren’t typically though of as a powerhouse, though they excel at running cool and using less energy. Pros and cons of Apple Silicon vs Intel. HTC 'Vive Air' leak reveals fitness-focused VR headset — watch out Quest 2! “I do not yet understand why the fast_float library is so much faster on the Apple M1. Question. It would be interesting to compare SIMD performance too. You might want to run some comparisons of that for your M1 vs Intel MacBooks… The API’s to look at are in Accelerate() Can you do a IO bound benchmark as reference? So I do not think that branch predictions is important in the sense that I expect both processors to predict the branch very well. May 26, 2020 Matt Mills Hardware , Tips and Tricks 0 Although it has been official for months, this whole issue of the change that Apple is going to carry out with Intel and the ARM chips is … Perhaps the biggest benefit of the ARM chips is that of compatibility. But since you have the hardware, why not give it a try? Please deactivate your ad blocker in order to see our subscription offer. If you’re looking for a comparison in terms of “better” and “worse,” this isn’t where you’ll find it. – ability to look ahead past shallow-ish dependency chains (ie deep issue queue) Thank you for signing up to Laptop Mag. If you rely on your Mac for work, and you’re in desperate need, clearly you can’t wait. But certainly on the Intel side we could learn (?) The Intel processor has nifty 256-bit SIMD instructions. BTW I was wrong. The Apple chip has nothing of the sort as part of its main CPU. It must be wrong, however. If you silo yourself to FP operations only, then only ports 0 and 1 can execute them (though stuff like bitwise logic, e.g. Have you read and understood my previous comment? That said, if you prefer to (or have to) bite the bullet and buy an Intel Mac, it’s not the end of the world. It would be interesting to see similar benchmarks for Risc V. I don’t believe any RISC-V processor is even remotely close to the level of performance of current top-end x86/ARM cores. A7 started at 6 wide, and around A11 bumped that to 8. This "leak" about switching to ARM came right before Intel's quarterly earnings on the same day. The M1 could retire more instructions per cycle but could it retire 2x the number of instructions? In fact, the MacBook Pro, MacBook Air, iMac Pro, Mac mini, and Mac Pro are already equipped with Apple-designed Arm processors, in the form of … As of August 2020, Apple hasn’t announced what the first ARM Mac models will be. Different to Intel, ARM Mac app developers only need to code a UI that is suitable for mobile UI, then they can issue the apps for iPhone and iPad. Maybe it is as simple as — this is VERY ILP friendly code, and Apple can execute it at IPC of 8. See my post ARM MacBook vs Intel MacBook: a SIMD benchmark, A computer science professor at the University of Quebec (TELUQ). I am not kidding. – fused ops count? Il y a de bonnes raisons d'acheter un Mac Intel en 2020, même si une toute nouvelle architecture est imminente. Daniel’s background stance on this type of benchmarking surrounds software with heavy usage of intrinsics and optimised routines. Which gives us info on that side, which we can then compare with as much as Apple tells us. So the only way to know if you’re using an Intel Mac or an Apple Silicon Mac is by using the About This Mac feature. The bigger MacBook Pro (today's 16-incher) uses Intel's more muscular H Series mobile chips, up to a Core i9, while Apple's Mac mini desktop uses a … Unlike typical Intel chips, the M1 features Arm architecture which is widely regarded as seeing superior power and thermal efficiency. On your Mac, click the Apple icon from the top-left corner of the menu bar, then select the “About This Mac” option. So, it’s not as if Apple is taking a shot in the dark by betting big on the same self-made chips in its laptop and desktop lineup. I don’t think it is irresponsible to ask for performance numbers. Apple also offers the Mac mini with two different Intel processors, a 3.0GHz 6-core Intel Core i5 with Turbo Boost up to 4.1GHz, and a 3.2GHz 6-core Intel Core i7 with Turbo Boost up to 4.6GHz. It is not that I do not appreciate the question, and I will try to answer it, but these things take more than 30 seconds. I don’t know how important that is with this type of code. You write that “[t]he Intel processor has nifty 256-bit SIMD instructions. I don’t know how important that is with this type of code. I do not accept any advertisement. . I have benchmarked this code on ARM processors before… just not on the A1. As other have noted, there’s plenty of NEON optimised software out there and it runs perfectly fine. It contains no ARM-specific optimization.”, It’s far from perfect but XCode/Instruments gives you access to performance counters on M1. Another benefit is that of cost. For some context, I have not given this issue any time at all. Intel vs ARM MacBook. July 2 update below, post originally published July 1. Of course, not all EUs support all operations, but I have no clue what the distribution is like on M1. How can you claim NEON is no match for AVX2 and then ask for performance numbers? – instruction count Apple has, for years, been stuck paying whatever Intel decides to charge for its chips. Mark Gurman at Bloomberg is reporting that Apple will finally announce that the Mac is transitioning to ARM chips at next week’s Worldwide Developer Conference (WWDC): Apple Inc. is preparing to announce a shift to its own main processors in Mac computers, replacing chips from Intel … See my post ARM MacBook vs Intel MacBook: a SIMD benchmark. How long does it take to count the number of 1’s in the input files? So is it worth waiting for a new ARM-based Mac, or is it alright to buy a new Mac with an Intel processor if you can’t stomach the two year wait? There is only so much Apple could do. Alors, devriez-vous acheter un nouveau Mac maintenant ou attendre ARM? However, you can support the blog with. Later architectures have some other configurations. Up to yesterday, my laptop was a large 15-inch MacBook Pro. This means some software written for this architecture, in theory, won’t work on ARM-based machines. – same number of instructions? There are 3x 256-bit ports (0, 1, 5) on Skylake. The Mac and its users will live on very happily just as they did after the 68x to PowerPC and the PowerPC to Intel architecture migrations. Apple launches a Quick Start program with access to documentation, sample code, and beta versions of macOS Big Sur and Xcode 12. Tuesday, January 19, 2021 Author by Ben Thompson. At the very least I think it’s important to validate assumptions like “of course they have more or less the same number of instructions executed”. Apple Inc. is preparing to announce a shift to its own main processors in Mac computers, replacing chips from Intel Corp., as early as this month at its annual developer conference, according to people familiar with the … It uses the the default Release mode in CMake (flags -O3 -DNDEBUG). 1. If Apple is getting ready to dump Intel for some of its … Because I have studied this code a bit (with performance counters), I know that the fast_float code has very few branch mispredictions. Though not much is known about the new chipset, it is expected that it will offer a better performance of the device along with improved battery life. They are used in devices where space, heat dissipation, and power draw aren’t as big of a concern as they would be in a mobile phone, for example. Both machines have been updated to the most recent compiler and operating system. It is possible that Apple has some neat optimizer tricks in its version of LLVM, but this code is quite generic and boring. – (the opposite of the above; dependency chains are very unimportant) ie the code does a lot of “parallel” work (many independent operations at every stage) so that Apple’s 8-wide decode and extreme flexibility in wide issue are no match for Intel’s 4 (or 5 or whatever depending on the precise details) decode width and less flexible issue. It contains an Intel Kaby Lake processor (3.8 GHz). The Intel chips Apple currently ships use 64-bit Intel architecture, which handles compute instructions differently than the upcoming ARM-based chips. So I could easily come up with examples that make the M1 look bad. instructions executed and retired and number of branches and branch Throw in some load/stores and branches and you’re easily also at 8wide issue. – memory aliasing/forwarding. July 2 update below, post originally published July 1. Is there a lot of writing to a location then immediately reading back from that location? Of course, from that point forward, if both have eliminated the branch misprediction bottleneck, one might do better than the other at pipelining the code. https://developer.apple.com/documentation/accelerate. gives one a start in asking what’s limiting performance. I’m not sure how you could get at the this third one. Rumors suggest a MacBook Air, and a redesigned iMac and MacBook Pro are all in the works. This could point to some level of unification in years to come, especially as we see iOS and iPadOS-like features make their way to macOS in Big Sur. Apple ARM vs Intel, Will Your CPUs Have Enough Performance? Though Windows 10 for ARM has been around for a while now, it’s a complete nightmare that’s full of bugs and glitches. Though it’s possible that, spec-wise, the ARM chip would be less powerful than the Intel processor, like most things Apple, it’s highly unlikely that it would release any machine that didn’t perform better than its Intel counterpart. Up to yesterday, my laptop was a large 15-inch MacBook Pro. At WWDC in June, the company pledged to fully transition the entire Mac line — MacBook, MacBook Pro, iMac, Mac Mini, and Mac Pro — to custom, ARM-based Apple Silicon processors within two years. Do you have benchmark numbers of a comparison between AVX2 on a recent x64 processor (Intel/AMD) and the equivalent on ARM NEON? The Apple chip has nothing of the sort as part of its main CPU.”. VXORPS, can run on port 5). I used a number parsing benchmark. A Windows 10 update is causing big PC gaming problems — here's how to uninstall it. ARM-based chips are just more energy efficient than their Intel counterparts, and for laptops, this could mean huge gains in battery life. I do not like to argue in the abstract. Les puces Intel actuellement livrées par Apple utilisent l'architecture Intel 64 bits, qui gère les instructions de calcul différemment des puces ARM à venir. The Mac and its users will live on very happily just as they did after the 68x to PowerPC and the PowerPC to Intel architecture migrations. It contains an Intel Kaby Lake processor (3.8 GHz). Intel benchmarks say Apple's M1 isn't faster. ARM vs. Intel As we’ve seen, ARM is better than Intel chips at decoding instructions. If Apple is getting ready to dump Intel for some of its … Get instant access to breaking news, the hottest reviews, great deals and helpful tips. So it boils down to It’s also worth noting that there could be some performance hiccups from apps relying on Rosetta, though it’s unclear at this time if that will be the case. The M1 has 4 units of 128 Bit each. I’d guess Clang will generate in many cases vectorized code so you’ll be able to see. BA1 1UA. In my basic tests, I generate random floating-point numbers in the unit interval (0,1) and I parse them back exactly. The PC vs Mac website is a part of the “Go PC” campaign which the company started back in February. I just got a brand-new 13-inch 2020 MacBook Pro with Apple’s M1 ARM chip (3.2 GHz). And that’s okay. I have strong reasons to expect that the numbers of instructions retired on different ARM processors are going to be the same because (1) I expect the compiled binaries to be similar (2) I expect that there are few mispredicted branches. We expect that, like the A-series chips in iPhones, iPads, and Apple TVs, Apple's Mac processors will be completely custom. Support for the transitional period spanned six years, which is about the average lifespan of a laptop these days. Even knowing the Intel IPC (close to 1? ... Apple's leading the industry with its chips for smartphones and tablets and can do the same for the Mac. Per core the Intel usually have 2 ports for 256 Bit so in total it works on 512 Bit of data ( I am not talking about the CPU’s with AVX512, I’m talking about the Skylake derived CPU’s). ARM is on the march. Each port is capable of 256 Bit operations (AVX2). To create code blocks or other preformatted text, indent by four spaces: To create not a block, but an inline code span, use backticks: For more help see http://daringfireball.net/projects/markdown/syntax. Laptops, for example, could switch back and forth between performance modes that max out processing power, and a battery saver mode that tweaks settings to achieve maximum battery life while sacrificing some horsepower. I run the same benchmarking program on both machines. In late June, Apple announced it was shifting production of the Mac's CPU from Intel x86 over to its own, ARM-based "Apple Silicon." At the very least I think it’s important to validate assumptions like “of course they have more or less the same number of instructions executed”. I did not imply that your question did not matter. Comparing an M1 Mac to an Intel equivalent is like night and day. AMX may not work for the sorts of JSON parsing weirdness for which you use AVX256 (that’ll have to wait for SVE/2, probably next year) but it does solve the problem of “I want to execute dense linear algebra fast”. ARM processors are primarily used in low-power devices like mobile phones and tablets. But we do know a thing or two about the potential of an ARM line. I think in that regard they are on par. Which gives us info on that side, which we can then compare with as much as Apple tells us. ... Apple tech) would be to either get an Intel one now, or wait until around this time next year to get a refreshed ARM Mac, as buying first-generation Apple stuff isn’t recommended if … close to 4?) Science and Technology (December 5th 2020), ARM MacBook vs Intel MacBook: a SIMD benchmark, Science and Technology links (April 17th 2021). If the most common dependency chains are (to guess numbers) around 150 instructions long, and x86’s issue queue is 100 instructions long while Apple’s is 200 long, then Apple can always be running two dependency chains in parallel, while most of the time Intel is operating on only one of them. (I assume both the instruction flow and data memory flow are trivial enough that they aren’t blocking. Both have their benefits, and their drawbacks. M1 probably CAN retire 8 instructions per cycle… It can certainly decode 8 per cycle so if anything retire will be 8 or higher. Mark Gurman at Bloomberg is reporting that Apple will finally announce that the Mac is transitioning to ARM chips at next week’s Worldwide Developer Conference (WWDC):. Clarify the obvious basic things Lastly, using Arm-based technologies makes it possible to have all the iPhone and iPad applications run on the Mac PC, with the ability to have a continuous connection with WiFi and … This "leak" about switching to ARM came right before Intel's quarterly earnings on the same day. dependency chains. I do care. x86 probably has a perf counter that gives the average depth of the I queue, but M1 may not make such a counter user-visible — though I expect it is there). Visit our corporate site. Given that I expect relatively few mispredictions, I expect that the number of instructions retired is going to be roughly the same as it would be on any other ARM processor. An Intel Mac VS ARM The announced ARM chipset will provide the complete control of the Mac systems to Apple that will enable them to fine-tune the apps and optimize the device performance. Intel chips use the 64-bit Intel architecture, which handles compute processes differently than ARM-based chips will in future devices. There isn’t a superior choice, generally speaking, though one may be superior to the other in terms of how you plan to use it. Probably not, but it really depends on whether or not you need a new PC right now, or if you can afford to wait a year or two. Steve Jobs predicted the Mac’s move from Intel to ARM processors – April 8, 2019 Intel execs believe that Apple’s ARM-based Macs could come as soon as 2020 – February 21, 2019 In fact, I raised the question in my blog post because I think it is interesting. Apple AMX (not Intel AMX) is not neural engine, it is on-CPU, no different conceptually from from NEON. En haut de la liste se trouve le fait que vous avez besoin d'un nouveau Mac, maintenant. Vector size is irrelevant to the performance discussion because each µarch will be optimised around their particular setup. But like all of us, I have only 26 hours per day. Manage your Stratechery subscription. My guess is that the ARM rich instructions are a better match to current technology (ie most of the ARM rich instructions can execute as a single cycle, whereas most of the Intel ones land up being cracked to two different types of operations and can’t benefit from any sort of single-cycle “lots of ALU’ing”.) © I think that the Apple M1 processor is a breakthrough … Continue reading ARM MacBook vs Intel MacBook: a SIMD … Apple’s announcement last month of the move away from Intel to ARM-based processors for the Mac … What about the SpecFP in the Anandtech review? In short, the transition from Intel X86 to ARM processor in Mac is a win-win-win move. Probably it’s time for me to order device with M1…. But we won’t discover them if (as so much of the internet insists) every time any particular aspect of the M1 is suggested as being better than x86 (better branch prediction, better memory aliasing support, …) the immediate assumption is that either Apple is not better along that dimension or, “so what if they are, it doesn’t matter”. Intel and ARMv8 both have “rich” instructions, ie instructions that do two things in one (eg on ARM shift-and-add, on Intel load-and-add). Current x86, Intel processors, are powerhouses. It’s nowhere near as polished as the Windows 10 you’re used to on Intel-based processors. It’s worth mentioning that when we talk about differences in ARM-based processors and Intel Macs, we’re speaking entirely of potential. There are some key advantages to using Apple’s own ARM-based A-series chips over X86 But if you can wait, wait. Let's reality-check the claims Intel just clapped back with a carefully crafted takedown of Apple's Arm-based M1 chip. In total it is also 512. The Skylake microarchitecture first appeared in 2015, so ARM-based Mac hardware has clearly been in the research and development phase for a … CPU vs. System-on-a … – branch mispredicts The total execution throughput of the M1 isn’t any less than that of your Kaby Lake chip – which is what matters. This is all a way of saying that "ARM transition," while a convenient shorthand, doesn't fully describe what we expect to happen with upcoming Macs. Would that have been a better choice? That seems like an interesting comparison. You (and other commenters) are aware of NEON, but apparently not of AMX. View all posts by Daniel Lemire. Doubling the register width makes a big difference, at least in some cases. Have you looked at the WikiChip architecture page? Under this campaign, the chipmaker pointed out the shortcomings of the Apple M1 chipset in a series of tweets on its official Twitter handle.. With the website, however, Intel is going all-in to encourage consumers to choose its platform over the Cupertino giant’s latest M1 devices. Future Publishing Limited Quay House, The Ambury, Intel vs ARM MacBook. Your email address will not be published. Intel and ARMv8 both have “rich” instructions, ie instructions that do two things in one (eg on ARM shift-and-add, on Intel load-and-add). I am aware of NEON, but it is no match for AVX2 in general. MacBooks, for example, might seem vastly underpowered to similar Windows-based PCs, though efficiency tweaks lead to the same, or often better, performance in real world use cases. Cela signifie que les logiciels écrits pour les Mac Intel ne fonctionneront pas en mode natif sur ARM. AVX2 adds 256b integer operations. His research is focused on software performance and data engineering. Even knowing the Intel IPC (close to 1? My guess is that the ARM rich instructions are a better match to current technology (ie most of the ARM rich instructions can execute as a single cycle, whereas most of the Intel ones land up being cracked to two different types of operations and can’t benefit from any sort of single-cycle “lots of ALU’ing”.) In this case, the tests are short and I do not expect the processors to be thermally constrained. That’s pretty a irresponsible stance. Your email address will not be published. How fast can you sort arrays of integers in Java? You could start by looking at the usual suspects – number of instructions executed and retired and number of branches and branch mispredicts. These system on chip (System on Chip) configurations have led to some of the fastest real-world processors on mobile devices over the last several years according to current benchmarks. Apple’s announcement last month of the move away from Intel to ARM-based processors for the Mac … Science and Technology links (March 27th 2021), My benchmarking software is available on GitHub, https://developer.apple.com/documentation/accelerate, http://daringfireball.net/projects/markdown/syntax. They then both crack these in different ways, then fuse the pieces in different ways. Look, if you need a new PC, it’s not worth the wait. Where’s that coming from? but 1.8x the performance so more than 2x the IPC. The primary benefits of the ARM chips are in their simplified construction. close to 4?) This turns out to be false. You will receive a verification email shortly. I have all the numbers for these… Just run my benchmark under Linux, it is instrumented and will give you straight back (without calling perf) the counter values. All rights reserved. I do not know this for a fact but it is how it looks. For example, Skylake can perform 3x 256b VPADDB per clock. No matrix multiplication in sight. No. Cela signifie que les logiciels écrits pour les Mac Intel ne fonctionneront pas en mode natif sur ARM. Since it has much wider decoding front it won’t get hurt by not having a 256 Bit operation in a single OP. The only three issues remaining that I can see are Lastly, using Arm-based technologies makes it possible to have all the iPhone and iPad applications run on the Mac PC, with the ability to have a continuous connection with WiFi and … Is there a lot of writing to a location then immediately reading back from that location? – CPU width Having said that, it’s not clear what support Rosetta will offer, and to what applications. You’ll typically find these in laptop and desktop PCs. Although Macs have used processors from Intel since 2006, new Macs … The server variation of Skylake has 2 x 512 Bit. For apple, the shift to Apple’s own ARM-based chips gives the firm even greater control over the its hardware and software; for developers, the common architecture across all Apple products makes it easier to code apps for Mac, iPhone, and iPad; for consumers, they will get more powerful hardware with a longer … ARM is on the march. Recently, I have been busy benchmarking number parsing routines where you convert a string into a floating-point number. The AMD Zen 2 IPC is 4 or even slightly better than 4. – micro-ops counts Another curious test is Lemire random number generator. It would need to retire something like 8 instructions per cycle. But, ARM processors are more mobile-friendly than Intel processors (in most cases). England and Wales company registration number 2008885. It is not that I don’t care about the questions you are asking. A newly leaked benchmark shows Apple's ARM-based A14X Bionic processor outperforming an Intel i9-powered MacBook Pro by a healthy margin. Apple, though, has accounted for this transitional period with an in-house project called Rosetta, which offers some degree of backwards compatibility with older apps and programs. You can even try something a simple as a portability layer to run your own benchmarks of your own AVX2 packages: https://simd-everywhere.github.io/blog/2020/06/22/transitioning-to-arm-with-simde.html. The first Apple Silicon processor is called the Apple M1. Intel vs Apple Silicon: Performance Intel has confirmed it’s releasing at least nine Tiger Lake processors, ranging from a 15-watt thermal envelope to 28-watts for increased performance power. For the vast majority of cases NEON should be functionally equivalent to AVX. Though ARM is typically considered the weaker of the two chips, Apple will no-doubt configure the chips to get the most out of them, much like it has in its mobile device line. How do they compare? There is no (substantial) memory writes in the hot loops being benchmarked. But certainly on the Intel side we could learn (?) For Floating Point operations there are only 2 ports. First, it’s important to note that the company has already been using custom chip in its iPhone and iPad line for years. There was a problem. Close. Ultimately, it’s up to you. I’m guessing no, as you seem to be completely ignoring it. Take note that wider SIMD doesn’t only affect the EUs, it’ll help with increasing effective PRF size, load/store etc. Here, you’ll find the specific software and hardware information that’s running on and powering your Mac. One bright spot to an Intel-based Mac is that it still has the ability to dual-boot Windows. I do not yet understand why the fast_float library is so much faster on the Apple M1. M1 has 128bit NEON registers, but 4 SIMD execution units, all with mul support, comparing to 2+1 in Kaby Lake.