This is one of those things where -- as I call it --
a little knowledge goes a 'wrong way.'Short version: Enabling SMT can make a processor
slower in some applications. And yes, both the OS and applications can mitigate that with time and experience. But it's not a simple 'fix the OS scheduler' answer.
Details ...
In a modern, superscalar, pipelined processor ... at any time, less than half of the stages in a pipeline are doing anything. Yes, your processor has over half of any sub-units do nothing at any time (that's a mega-oversimplification, but represents the end-reality equivalent). It's even worse in the x86[-64] CISC (Complex Instruction) architecture.
► Show Spoiler
Clocked Boolean Logic (CBL), the instruction set and CISC are all 1960s Computer Science (CS) "constructs" that Electrical Engineers (EEs) were unable to eradicate in-time before the x86 'took hold' -- long story.
Short version: CBL and assembler are not how processors actually work, not at all, and definitely not in the 21st century. You'll hear otherwise from the 500+ colleges teaching CS and programming in the US alone, but for those ~75 that have actual, real EE-semi programs, it's the truth.
I.e.,
1) Boolean logic comes from pre-integrated circuit, when mathematicians and computer scientists threw switches, to make logic decisions. They stupidly replaced the switch thrower with a clock, which is the worst thing from an EE perspective. Boolean logic should have never been used in computers, because it lacks an inherent, self-synchronizing control. But here we are, a CBL world. The solution EEs have taken is to make sub-units completely CBL-free, and asynchronously timed, sometimes not using boolean at all, but a form of other logic, like Karl Fant's Null Convention Logic (NCL), which is very EMI-resistant (think space and military operations).
Disclaimer: I professionally worked with Karl Fant and Steve Furber (co-inventor of the original ARM -- yes, the architecture in your smartphones and tablets) in the late '90s, when there was a short-lived Silicon Valley based in Orlando (Lockheed-Martin's Real3D, which became Intel's GPU linage to this today, although both ATI and nVidia have fabless design teams in Orlando to this day). My degree focus is in semiconductor materials and layout, the only time I actually used my degree specialty in my career.
2) RISC didn't come about until the 3rd gen (C compiler) became commonplace, which allowed EEs to 'remove the root cause' at the assembler mnemonics -- i.e., assembler is a human for of generating machine instructions. As long as programmers used it, it would be only as efficient as the coder. But the instruction set itself was a math/CS artificial construct. EEs think in MEAGs/'one-hot' (long story), which means, if it can boil down to a trace, it's easier to layout and, more importantly, time better. Ergo ... RISC, reduced instruction set computing, more direct traces.
Even today, the variable length 8 to 128+ bit (and that's just for 32-bit) x86[-64] 'word' is often 'decoded' into what's known as RISC86 (NexGen designed on in 1986, it's at the heart of AMD's ALU since '94), a fixed 32-bit (or newer 40+ bit) 'word'.
Since the '90s, programmers are not going to out-smart the EEs that wrote the optimizing C compiler in assembler and, at best, only those in-line assembler instructions in the programmer manual they wrote should be used, at most. It takes years to understand the underlying layout and quirks of any design, which is why assembler has become quite useless since the '90s (other than in-lining where the EEs tell programmers
3) The 'CS Einsteins' Intel learned this first-hand when EPIC (Explicitly Parallel) and Predication (remove Branch Prediction, to regain a lot of silicon) failed spectacularly as Digital Semiconductor said it would. You never heard of EPIC/Predication, only Itanium. The problem with Itanium wasn't lack of x86 compatibility. It was that EPIC/Predication worked even worse than CISC. It tried to optimize the chip at the instruction set, which every EE major told every CS major it would fail, and it did.
EPIC/Predication It's was like watching a construction worker design a bridge.
They had no understanding below the digital level, like layout and timing, the relativistic effects of electromagnetic fields and the limits of light (yes, back as early as 1989, the clock couldn't reach the other side of the chip in a single cycle), kinda like a construction worker doesn't know the first thing about elementary engineering statics (much less dynamics) -- all of which requires a deep understanding of differential equations to even begin. I.e., in the case of semiconductor signals, materials, etc... this is calculus beyond just the first year, where as civil engineers (bridge builders) can get by with just 2 semmesters and do statics (dynamics and EM is another story).
It was like watching a bad joke for 5+ years. The design world today fully admits the greatest RISC design is Alpha, purposely designed to be the most anal on timing and avoiding bottlenecks, it destroyed x86 (let alone Itanium) in performance, but Digital decided to break itself apart in the late '90s (they designed almost everything -- from chipset logic to network ASICs at the time -- and make a huge amount of money for its stockholders (of which it did, and very well). At least AMD gained most of their knowledge, especially when they absorbed the spin-off API Networks (API stood for Alpha Processor, Inc.)
Symmetric Multi Threading (SMT) itself is a technique, a layer of register/microcode-backed abstraction in the processor, which attempts to fill unused stages of a pipeline with another -- albeit faux ('virtualized', sort of) -- pipeline of stages. In other words, you only have X cores, and you are running n*X threads through it -- where n usually is 2 (any more with today's x86 legacy designs, and it would be far more inefficient).
► Show Spoiler
Sun, now Oracle's, T-series (essentially for 'thread') actually does far more threads per cores. It traces its lineage back to the RISC design of the SPARC from Berkeley.
But with x86-64, even using RISC86 microcode, that's impossible. It's a complex beast, let alone Intel and AMD have sacked it with all sorts of extensions that make SMT impossible, and limits SMT.
Side Trivia: Ironically, Sun (which was literally taken from the "Stanford University Network") used the University of California at Berkeley's SPARC design, instead of sticking with MIPS from ... Stanford.
At the basic level ... this should have nothing to do with the OS, as it's presenting n as X -- e.g., 8 cores as 16 threads, so the OS schedules as 16 cores. However, the OS can factor in.
E.g., Windows isn't exactly known for shipping a flexible, modular kernel and library set versus ... say ... Linux either (don't get me started).
Because ... the main problem?
There's overhead in SMT. There is register renaming, stack and even stalls, depending on the stage and what it was doing, involved. Things that can really be self-defeating if it doesn't do things efficiently. Just because the OS doesn't deal with it doesn't mean it's not going on.
And this happens on the Intel i-series cores, just like the AMD Ryzen.
But because Intel has been doing SMT since the Netburst (Pentium 4) architecture, Windows has learned how to optimize for it. But there's a reason many of us disable SMT on even Intel processors, for gaming and some other applications, things that are thread-ignorant, and have critical paths on a single thread or two.
Threading is good for either multi-user or well threaded applications. It's not so good for most traditional gaming engines, which have critical paths on 1-2 threads.
Even worse? Virtually all compilers can not only optimize for specific instruction sets and extensions (which are usually loadable), but more importantly, they optimize for scheduling. The classic is the old, in-order (non out-of-order) Atom. If you optimized for a full Intel i-series, it would suck (like 20%+ degredation) on the in-order Atoms. But if you optimized for an Atom, there was only a 1-2% hit on the i-series.
This is the first AMD SMT design. There are bound to be use cases that are found where it degrades performance ... just like the Pentium 4's SMT did. Over time, they will be addressed, both via AMD microcode updates, and then newer processors ... as well as at the OS level. You'll see the Linux Upstream kernel publish these first, and then FreeBSD will use them, which Apple (the Darwin and other codebases of MacOS X, iOS, etc... is based on FreeBSD) will use, and the NT team in Microsoft will track these as well.
It goes the other way as well, but usually not in a way consumers see.
► Show Spoiler
E.g., on Wall Street in 2008, we had the very first Intel Nehalam processors, multi-socket systems were crashing left'n right due to coherency issues. I'm under NDA, but let's just say there were errata published that you can read about in the Linux kernel.
The root cause was when Intel finally upgraded its legacy, 36-bit platform addressing limitation at the i486-designed, i686-extended TLB (Translation Lookside Buffer) from 1989, to 38-bit/256GiB support in 2008, they had all sorts of issues maintaining coherency between sockets. AMD had gone 40-bit/1TiB EV6 (from the Alpha -- again, AMD really was way ahead of Intel, thanx to the API acquisition) with the original 32-bit Athlon, and learned a lot with the Athlon MP (multiprocessor), which then was directly applied to the Opteron and its HyperTransport design. Bam! By 2004, AMD had addressed everything from a 1TiB+ RAM capable TLB to a full, complete I/O MMU, in the Opteron from the get-go.
Intel didn't and not until the 2nd revision of the i-series in the '10s, well over a half-decade later, including the I/O MMU on the new QuickPath Interconnect (QPI) implementations. Again, most people don't see these because they are multi-socket considerations, although multi-core if QPI is in the same socket (e.g., 14-18 core E5 processors). It's why we preferred AMD until the i-series Core processors.
Furthermore, when AMD has a similar issue when they bumped the TLB to support the full limits of the x86-64 'Long Mode' design (which is limited to 48-bit/256TiB, or 52-bit/4PiB paging), they decided to hold off on releasing the multi-socket versions. The IT enthusiast sites lambasted AMD for this, but they never knew about Intel's massive screw up.
Ironically, it was those of us doing cutting-edge, high speed trading applications on Linux that saved a lot of even the Windows Server world from the same fate with Intel's original 'Core' Nehalem designs.