Perhaps you’re thinking about SMT (e.g. Intel Hyper Threading) when you say “proper multithreading”?
I’m not sure it’s valid to say that only SMT is “proper multithreading”, especially since multithreading as a concept predates it by quite a way.
SMT has a quite a few performance issues since resources such as the L1, L2, and branch predictor are shared between the threads, which can lead to contention that hurts the performance of all the SMT threads sharing a physical core.
SMP is no less “proper”, and as core counts have increased significantly on commodity CPUs, the use of spinning threads bound to a single core each has become a common paradigm.
Oversubscription without SMT (i.e. many threads per core) is possible, but unless you have a workload where each thread is I/O bound with a substantial amount of time spent blocking, the overhead of scheduling and context switching means throughput will likely decrease.
All SMT does is allow multiple instruction counters on the same superscalar core. It increases utilization of all compute units and therefore increases throughput.
Of course it increases latency, since those resources are not fully exclusive to a particular thread anymore.
Whether or not it's a good thing depends on what you care about. You could also argue that a good program would be able to saturate a single superscalar core with a single thread and thus wouldn't benefit from SMT at all, but I think that would be hard to guarantee in practice.
Well sure, but why would that make it "improper multithreading"? Is polymorphism based on vtables not "proper OOP"? We rely on many abstractions that aren't free in terms of CPU cycles because it makes development easier or less error prone.
And setting up, say, one thread per HTTP request will likely be negligible because blocking I/O is where time is spent anyways..
Any networking program doing blocking I/O is doing it wrong.
Your I/O should only be done synchronously if it's non-blocking.
Now for disk I/O, it's a more muddy thing, it's actually quite different from networking since it's more transparently managed by the operating system.
My limited understanding is that there's less cache thrashing (from multiple different workloads scheduled on the same core) and less scheduler overhead (from less overall threads).
Just to add that scheduling overhead goes away with SMT (assuming you don’t oversubscribe), but the sharing of caches and branch prediction logic is still an issue as you point out.
but how does this kind of multithreading (one thread per core) is better than proper multithreading (many threads per core)?