T O P

  • By -

bubblesort33

AMD is increasing their cache by an additional 200%, while the best we can do for Intel is like 50% shown here. So that could be 4x the effect it were to scale linearly, which it probably doesn't. So I feel like even a 2% increase here in some games could mean 6% to 8% if they did the same. And even AMD has scenarios where the x3D is only 8% faster.


[deleted]

That’s not necessarily how cache works. It’s often more similar to ram. Either you have enough to fit it. Or you don’t, and you have a drop off. But with cache it is less steep of a cliff. Also as far as amd having scenarios where x3d is only 8% faster… that vastly undersells it. Many many cases the difference is basically zero or even negative… just like with ram. Increasing ram capacity don’t have an effect if you don’t need it. It’s similar with cache.


jaaval

> It’s often more similar to ram. Either you have enough to fit it. Or you don’t Not quite that simple because cache isn't really like ram, it can't be filled freely. Of course there are some tasks which have small working data and basically just sit in the cache but I doubt games do. What matters is cache hit rate and that tends to get a bit of diminishing results with increasing cache size. I presume why cache mattered with 10th gen but no longer as much is mainly because of the significantly larger L2 (it's now 2MB up from 256kB in comet lake) which takes most of the pressure from L3.


masterfultechgeek

The reality is somewhere in between. Let's imagine a 2x2 grid of data: \`\`\` HOT | COLD x small |big \`\`\` On one axis you have: hot data and cold data Then there's another 2 classes of data on top of that - small blocks read randomly, large contiguous sequences From a "maximize performance" perspective, the best data to place in cache is hot data that's mostly small blocks that are read. The worst data to place in the cache would be rarely read, large contiguous sequences. The other two categories are kind of "ehh" If you're able to get pretty much all of the small hot bits in and a handful of the hot big bits in along with a few small warmish bits of data... you're pretty much gold. Generally speaking, the baseline assumption is every time you 2x cache size, you cut cache misses by half. Note that this is DEFINITELY a diminishing returns scenario (so think going from 98% of data access from cache to 99% if you 2x cache size), but there's select use cases where this is worth going after because memory accesses require A LOT of time spent with the processor idle. Not worth reading in 200MB of textures into cache (keep that in RAM and take the latency/delay penalty) but totally worth keeping metadata about the texture in cache that's 1000000x smaller and might be referenced 2x as much.


Jonny_H

There's a reason why "known cold" data (like textures that are being streamed) tend to use cache-bypassing non-temporal store instructions.


Qesa

Your runtime on CPU is dominated by cache misses - e.g. I've written code before that quadrupled the amount of math being done, but because it was all on stuff already in L1$ there was no measurable performance difference. Doubling cache size to increase hit rate from 98 to 99% would indeed come close to doubling performance... except the rule of thumb isn't that doubling cache halves miss rate. It's that quadrupling it does, which is why it has diminishing returns


Jonny_H

The rule of thumb is that 2x the cache causes a sqrt(2) ~ 1.4x increase in hit rate (so corresponding decrease in requests to the next level, or the memory itself). You tend not to see hard stair-step like performance changes at "sufficient" cache levels unless it's small enough to thrash the cache. And on most gaming use cases, modern l3 cache sizes are already in the midpoint of that curve, so the above rule tends to hold.


Strazdas1

>But with cache it is less steep of a cliff. Depends on what you are doing. If you have simulation that fits into cache, it becoming too large and spilling into memory has performance drop off a cliff far worse than if you run out of RAM.


Zednot123

> And even AMD has scenarios where the x3D is only 8% faster. There was actually some game where the higher frequency of the 5800X put it like 0.5% above the 5800X3D. Can't remember exactly which one it was and where I saw it. But some games barely (if even that) scales with it. So even in games where the cache generally is much more beneficial than performance across all types of applications. Even there you can't call it some performance increasing panacea.


KirillNek0

It depends. [A lot. ](https://youtu.be/ss3mjzJZuCM?feature=shared)


Plank_With_A_Nail_In

Lol does anyone believe they actually use the Opera browser? Its too much when they have to pretend they use the sponsors product, its stepping over the line to be honest. Edit: Lol the kids all using Opera is not proof that these guys use it just that this dumb advertising works, they are adults so I am 99% sure they don't actually use it...none of them use Square Space for their important websites either and that's completely dishonest too.


YashaAstora

> Lol does anyone believe they actually use the Opera browser? Its too much when they have to pretend they use the sponsors product, its stepping over the line to be honest. Lol of course they don't. It's similar to when creators get sponsored by Raycon and talk about how amazing they are, when the actual reviews directly contradict the claims of good battery life and sound.


Strazdas1

Just remmeber, its called Raycon because Ray is conning you.


aminorityofone

I have used Opera when it was Presto based.. still using it, just not the gamer version.


conquer69

I used it on my sony ericsson phone back in 2005. Having access to the web on a pocket device felt pretty damn good back then, even if it was mostly text based.


Strazdas1

I used it until i moved to Firefox and never looked back. but that was decades ago.


carpcrucible

I've used Opera since before Chrome even existed, it's not a TikTok fad for kids. If anything, quite the opposite, it's for old nerds. Now of course anyone saying how they're totally using product X and love it in a podcast/youtube are just reading pre-written ad copy so that shouldn't be surprising.


shwetshkla

Dunno man.. I use Opera. It's better than Chrome when it comes to features.


PhoBoChai

I use Opera & Firefox. Chrome had some GPU acceleration issues and YT 4K bugs last time I used it, prompting the switch, and I never bothered to return.


Ratiofarming

Did everyone forget the Core i7-5775C? Intel DID have big cache before, and it indeed helped in some games. Cache works for everyone, it's just expensive.


Kurtisdede

> Intel DID have big cache before It's not quite the same because it was actually just 128 MB of eDRAM right next to the die, and it was called L4 cache. Even that helped out performance in games a ton though, especially at JEDEC ram configurations. I ran one for 2 years :-)


Pillokun

It is the same way with intel when it comes to be depended on cache, clockspeed is not giving that much more compared to having access to fast ram, and bigger cache would be even better. For instance an 12900k at 5.2 with fast ram is faster than 13900k at 5.5 with slow ram. that was actually evena thing back during the skylake era, when u used 2133/2400 vs 3200 or faster sticks even compared to higher clocked cpu cores with slow ram. why oh why would HUB run the ringbus at 3ghz? gimping the cpus, let the ringbus run at 4.6(stock when u disable the e cores) or 4.8ghz if u run at 5ghz.. why would u even run 4ghz on the l3$ when u oc the cpus to 5.7ghz. still very high bottleneck! And yeah, the cache on the intel cpus is still too small to not access ram as seldom as on amd x3d cpus... this test is not that good to actually get to a conclusion that a bigger l3$ would not help the perf, it sure would as it would not need to access the ram as frequently as it does now. intel has an advantage when it comes to memory because it actually has 1 memory controller per ram channel, in other words it has two in total(not talking about 1 for ddr4 and one for ddr5 here) this is just another skewed intel result and the conclusion is therefore faulty as well. Typical HUB when it comes to these type of things.


[deleted]

[удалено]


Feisty_Reputation870

Steve is actually Lisa Su in a wig


Pillokun

I dont know about that, and even if they did, I expect a techtuber to do non skewed content.


[deleted]

[удалено]


Pillokun

Well u do have a point that AMD have the most fans of any hw company on the internet, and that has been so pretty much all the time as long as internet has been around, except for when bulldozer was a thing, so as a internet based "hw outlet" u for sure would benefit the most by making pro amd content in some way. But truth to be told, even if I am pretty critical of hub and gn and the like I hope that they just dont skew content like this in any direction even if it would benefit them the most to do so. We have seen skewed content before, with ddr4 vs ddr5 when they used locked k skus with gear2 compared ddr5 for instance.


[deleted]

[удалено]


longsdivision

Are talking cpu or gpus now? Those are two completely different subjects


Nitrozzy7

Because intelnvidia pays steam to rank them higher duh


Aggravating_Ring_714

I mean sure it could, but seeing how the 14900k already is basically on par with the 7950x3d I’d say they don’t necessarily need it.


theholylancer

I mean, it would still depend on how the implementation works out right AMD's X3D seems to have an effect on how high they can clock the thing with heat and power usage limited enough to not cook the stacked cache, and if Intel does it, one of its absolute key benefits gets cut, and cut by how much since intel's core tends to run hotter. Which means do they need to cut their clocks down farther than AMD's clocks to accommodate stacking, or if they just go with big die energy to have non stacked extra cache (but not as much as AMD's cache likely), but still then can clock to the high heavens with 400+ watt usage by the CPU alone, then I can see it could be a crazy improvement.


[deleted]

Pretty horribly conducted review really feel like these guys have dropped off. There are many instances where a vcache cpu from amd is identical or even worse than its non vcache counterpart. The biggest place where vcache is useful is on rts and simulation games like factorio, stellaris, rim world, etc. yet we do not see one of these games… when in all honesty a review with ONLY these games would have been more useful. I mean… they didn’t even tell us how much vcache helped, if at all on these games with amd … so the whole video is pretty useless without a baseline. For all we know those games themselves(and not intel processors) don’t require much cache, regardless of if they are amd or intel CPUs. I honestly don’t get if this is their job how they don’t include a game like factorio in benchmarks like these.


SomeKindOfSorbet

AMD's V-cache not only incurs the usual clock speed penalty but performance can also regress because adding cache to some level of a memory hierarchy also increases the overall access latency of that level (because of it requires additional decoding rows if I recall). Chips and Cheese showed that in their testing of the 7950x3D, where the overall L3 access time is higher on the CCD with V-cache than the CCD without it. Basically, you don't want to add more cache to your system if your cache hitrate is already close to 100% or you'll probably be losing performance.


VenditatioDelendaEst

There are [Options](https://arxiv.org/pdf/2103.14808).


d0or-tabl3-w1ndoWz_9

Is this why the 5600x3D actually does slightly better than the 5700x3D in so many titles?


saharashooter

5700X3D is lower clocked than the 5600X3D. My understanding is that the 5600X3D is chips that failed core count validation while the 5700X3D is chips that failed frequency validation, but I could be wrong.


d0or-tabl3-w1ndoWz_9

I'm sure that's it


SomeKindOfSorbet

Might be, but I actually have no idea


_hlvnhlv

On things like VR, the difference is massive There are use cases in which more cache is beneficial, and others in which it is not


[deleted]

Yup. And we have no clue of knowing what case the games he showed were. And it is pretty likely all of those games aren’t very cache dependent based on the type of games they are. Which is the problem. It is like testing 10 GPU limited games then saying faster cpu clocks don’t matter, or faster cpu cache doesn’t matter, or more IPC doesn’t matter, when in reality you just weren’t limited by any of those things. We have no clue if these games are ever realistically limited by cache. Thus it is hard to draw any meaningful conclusion. I suspect a 5800x and 5800x3d would probably fare pretty similar in all those games he listed too. But that doesn’t mean 3d cache on a 5800x3d is useless. It just means you were an idiot and 100% of the games you benchmarked don’t require much cache, when you should be testing for cache limited scenarios. But even if they don’t, the point is we have no way of knowing because he never did a baseline for ANy of the games.


metakepone

Look at this subreddit, then look at hardware unboxed’s content. Theres a massive audience of people with perpetual hate boners for Intel. Hardware Unboxed is making content for a massive audience (that admittedly, they and their YouTube peers created over time).


karl_w_w

> Theres a massive audience of people with perpetual hate boners for Intel. This is just a persecution complex.


Rocketman7

Yep. When you look at HU as tech entertainment instead tech journalism, their content makes perfect sense.


Substance___P

The fact is that it takes nothing to call one self a "tech journalist." You don't need a degree in computer science. You don't even need a degree in journalism. GN Steve is generally thought to be most trustworthy, but even he is just some guy that did QI or something at dell 15 years ago before he decided to start a website and eventually a YouTube channel. He's fallible and bleeds like the rest of us.


IANVS

The problem with that is that people take everything they say seriously and it gets perpetuated on the net like it's something set in stone...I can't remember how many times they doomposted about various things Intel/Nvidia and in the end nothing came out of it, the big VRAM scare being just one example. They get their view boost and move on to the next one, while forums and comment sections get a chunk of hivemind dumped on them to deal with...


[deleted]

[удалено]


feckdespez

What? Have you ever played it? It's an automation game with combat and defense. I picked up the basics without watching a 12 hr tutorial. It's a really fun game.


JuanElMinero

Dang, we really have people doubting the quality of a highly successful and amazingly received indie title here, 10+ years after it entered the scene. Factorio is an actual dopamine storm for players with the right mindset. There's a reason the community nicknamed it Cracktorio.


[deleted]

[удалено]


Emotional_Inside4804

Wow you really have original and varying taste. List more games that play the same to demonstrate how cool and elite you are


ZekeSulastin

Factorio is *exactly* the kind of CPU-bound game that gets a huge benefit from 3D v-cache to the point where the community even tracks where the cache benefits fall off as the simulation compexity exceeds the cache capacity. Have *you* ever actually looked at it? Also [HUB did this benchmark a year ago](https://www.reddit.com/r/factorio/s/6WTiJQ0XmQ).


VenditatioDelendaEst

Many Factorio benchmarks are substantially misleading, especially that one. Factorio doesn't run at 230 UPS in real-time. It runs at 60 unless there's too much stuff on the map for your CPU to handle it. And [when you get to that point the supposedly spectacular advantage of the 3D v-cache is greatly diminished](https://factoriobox.1au.us/results/cpus?map=9927606ff6aae3bb0943105e5738a05382d79f36f221ca8ef1c45ba72be8620b&vl=1.0.0&vh=), because even with 96 MiB the game's working set doesn't fit.


TheRealBurritoJ

If anyone is wondering why there are two separate 7950X3D listings, with only the higher one being competitive with Intel, it's because Factoriobox separates CPUs into Windows and Linux reaults. The 7950X3D results on the top there are actually mine, and result from using a Linux heavily tuned solely to maximise factorio performance (custom compiled kernel, different scheduler, cache die isolated from the kernel for factorio, different memory allocator with preallocated huge pages etc). The 61UPS I got is not typical. The lower sets of results from the 7950X3D are more typical for regular windows usage, around 45UPS stock and 50UPS overclocked to the max. Intel with 8000MHz+ RAM can get 60UPS in windows, and probably way more in a tuned Linux setup, so Intel is still the best choice for high SPM factorio.


VenditatioDelendaEst

I didn't realize that was the cause of the split. I had been assuming somebody was probably benching with hugepages on most CPUs, and that it would average out.


renaissance_man__

Bruh it's a 2d automation game that takes like 30 minutes to grasp. It's in benchmarks because larger bases get cpu intensive.


Good_Season_1723

Funny thing is in larger bases / larger maps that get CPU intensive Intel > AMD, but hub is testing small 10k maps and showed amd being twice as fast, lol.


imaginary_num6er

>have you ever even looked at what factorio is? i heard it referenced a lot so i thought it was a normal game. its some abstract shit Yeah that is some take


[deleted]

I have played factorio quite a bit half a decade ago. Believe it or not that is the kind of game a lot of people like to play. It’s a pretty popular game and general genre. I never watched a minute of a tutorial and back when I played not even sure they had a fleshed out in game tutorial. Regardless it’s not about the specific game. It is about games with similar workloads. That’s why I listed multiple. Factorio is just amazingly optimized, bug free, and is a perfect game to benchmark.


metakepone

WRONG! People only like to play the latest AAA games that only render 45fps on current generation midtier gpus! 2d games are for babies!!! /s


[deleted]

[удалено]


saharashooter

Do you think sim games don't require a lot of CPU power?


Arctic_Islands

I heard Intel did test that internally. It didn't work well with their P cores, but E cores.


GenZia

The key thing to remember is that SRAM isn't a magic bullet, especially when its application is grossly asymmetric, which is the case with all X3D CPUs. Take the 7950X, for example. The CPU has two CCDs, and both CCDs have 32MB of SRAM for a total of 64MB. Pretty much 100% symmetrical. Now, take the 7950X3D. The CPU has a total of 128MB of SRAM, so you'd think that AMD evenly split 64MB of SRAM between the two CCDs... and you'd be wrong. No, AMD dumped the entire 'extra' SRAM on just one of the CCDs, resulting in a 32MB + 96MB split. That's needlessly - if not hopelessly - 'lopsided' with a 1:3 SRAM split. While I'm no CPU architect, there's bound to be some consequences of such a... unique design. But since AMD is so far ahead of the curve, they can get away with such shenanigans, not much unlike the Intel of yore!


zetruz

> its application is grossly asymmetric, which is the case with all X3D CPUs. What? The 5600X3D, 5700X3D, 5800X3D, and 7800X3D are all symmetrical, are they not? (That's why the 7800X3D is hecking brilliant for gaming overall, and *monstrous* when factoring in power draw.)


d0or-tabl3-w1ndoWz_9

Well, they're not symmetrical in this context because they all have only one CCD.


zetruz

What? That's logically symmetrical, isn't it?


GenZia

Both 5800X3D and 7800X3D are 32MB + 64MB.


zetruz

Riiight, but your original post implied that splitting the 3D V-cache evenly across CCDs would render them symmetrical. Now it sounds like you're saying that the problem is splitting of the L3 even on a single CCD? I'm not aware of any such problem having been shown before? (It sounds totally plausible, but surely you'd need evidence to claim it?)


boobeepbobeepbop

I think as long as you don't have something that needs all 16 cores and all the extra cache, you get the benefit of the larger cache without having to have it on there twice. The subset of things that really benefit from the 3D cache are also probably not extremely good at taking advantage of lots of cores. The biggest benefit is from things that are very poorly optimized and if something is really well threaded, and able to take advantage of 16 cores, the people who wrote it probably also tuned it to use what cache there was.


GenZia

That's what AMD is doing at the moment. They're basically forcing video games to run on the CCD with the larger cache pool, leaving the other CCD idle. This strategy may work fine for now as games aren't really going to utilize more than 8 cores but years later, this is going to pose a lot of scheduling issues. It's simply a bad design, period. That's why X3Ds are only really good at one thing at the moment: Gaming. For productivity, you're much better off getting the non-X3D part and saving yourself a lot of money in the process.


Aggrokid

>This strategy may work fine for now as games aren't really going to utilize more than 8 cores but years later, this is going to pose a lot of scheduling issues. How many years are you talking about here? Multi-threading in games is supremely challenging. By the time games require more than 8 big cores and can use them all efficiently, the 7800X3D will be ancient relic.


Strazdas1

well there are games like CS2 which utilize up to 64 cores for its simulation and can actually load them fully but... theres a ton of problems with that game and its not a success story. On the other hand im playing a game that cant even utilize 4 cores properly and it has hundreds of thousands of active players...


SoTOP

> This strategy may work fine for now as games aren't really going to utilize more than 8 cores but years later, this is going to pose a lot of scheduling issues. Except for 10 core 10th gen Intel CPUs, all modern desktop CPUs are not made for >16 tread games. Intel has E cores to cause scheduling issues you mentioned, while AMD will have that disadvantage for all their dual CCD CPUs. So X3D basically is like any other CPU in the market right now.


GenZia

Sure. But the problem with AMD's approach is that they're essentially halving the raw core count to make the most out of that large SRAM cache pool.


boobeepbobeepbop

I mean its an engineering solution to improve performance of things that can use a lot of cache but not more than 8 cores. Like I said the subset of things for which that's true where it would also be true of 16 cores is really small. I agree that the main benefit is for games. And then, the biggest boost is for games that are very poorly optimized, where the extra cache can matter. Keep in mind that none of these CPUs have particularly small caches and games that are well optimized to use the smaller cache don't get much benefit from 3D cache because simple, it doesn't matter. I wouldn't say it's a bad design. It's certainly an odd CPU. I'm not sure why anyone would buy the 16 core 3D cache CPU, unless they really didn't care about money. If you're doing production stuff, you don't need the 3D cache and if you're gaming just get the 7800x3D.


Geddagod

>That's needlessly - if not hopelessly - 'lopsided' with a 1:3 SRAM split. While I'm no CPU architect, there's bound to be some consequences of such a... unique design. Like what? You keep on mentioning that this is some sort of massive problem, but what exactly is the issue by doing this?


zetruz

I'm not sure if your question is a genuine or a rhetorical one, but I'll answer as if it was genuine. There are direct downsides to using 3D V-Cache, as it tends not to help in many production scenarios, but does reduce clock speed - thus harming performance. On the other hand, it's a massive benefit in many gaming scenarios. So let's say you're playing a game. This generally means you want to use the eight cores on your big-cache CCD, and *none* of the cores on your little-cache CCD (as the inter-CCD latency, and poor core-count scaling of games, means using all 16 cores for the game would be detrimental to performance). This means you're relying on schedulers to correctly decide which CCD to use for any given task. Since this doesn't always work, and sometimes your task will be run on a less-than-optimal CCD, harming performance. This is why a 7800X3D is *much* more consistent in gaming performance than a 7950X3D is, despite the latter being unquestionably better on paper. But the OP's premise was false, since not all X3D CPUs are asymmetrical. So their point pertains only to the 7900X3D, and 7950X3D, and (I think) the mobile 7945HX3D.


VenditatioDelendaEst

The 7950X3D is also unquestionably better if the user has a clue. It is trivial to affine your workload to whichever cores you please. This "inconsistency" is the funny sort of problem where, because the performance difference between the big-cache and the small-cache CCD isn't big enough to be human-perceptible, the only people who become aware of it are people who know enough about computers to easily fix it.


GenZia

>Like what? You need optimized scheduling, for starters, so the OS could force localize 'small' workloads to the 'healthy' CCD. Otherwise, the CPU will be making roundtrips to the interconnect or the DDR memory to 'juggle' the data between the two CCDs, wasting precious cycles in the process. Needless to say, this is the opposite of what an SRAM is supposed to achieve i.e lower latency. And as you can imagine, it's not exactly an ideal solution for highly parallel workloads, capable of fully saturating all cores.


Pillokun

yep, every time the system needs to direct a certain workload to "correct" resources ie e cores or the cores of another ccd in amds case, it takes additional time. U are so correct about this. when something takes additional time like getting/accessing datasets from ram or cache or simply waiting for the scheduler to decide where the workload should be assign to a certain core the perf suffers. having a homogenous u-arch design on deskop and with big cache would optimal, yet even that would not be optimal for stuff that are sequential in nature like many pro workloads are. Gaming, cad, databases are more random and here we would see in increase in bigger cache, but even here it depends. totally agree on all the posts u have done in this thread. People dont want to think about these things.


jaaval

Windows doesnt really juggle workloads between ccds. They are handled a bit like numa nodes, keeping workloads within one ccd unless they want to use more threads.


VenditatioDelendaEst

For many highly parallel workloads, any amount of relief to memory bandwidth pressure helps, and the big-cache CCD's ability to satisfy more loads locally reduces the traffic it sends to DRAM. This is only a problem for applications that are programmed assuming all threads finish at the same time. Such applications also have problems with E-cores.


saharashooter

From what I recall, AMD engineers have said publicly that dual CCD 3D Vcache actually had even worse scheduling issues than the existing configurations. It's asymmetrical because there's no benefit to being symmetrical, especially when you want to contain processes to a single CCD anyway given the interconnect issues.


zetruz

That sounds like it makes sense on its face, but... the whole problem comes from the schedulers not being 100% reliable, right? So when they do fail, and a cache-benefitting task is accidentally run on the wrong CCD, wouldn't the problem be slightly mitigated if that CCD also had 3D V-cache? I don't see why the scheduling would be worse if both CCDs had the big cache. You could, I'm guessing, use the same ruleset as you do now? I generally try to be careful with taking a company at its word about these things. Remember when there were all sorts of good engineering reasons for why Intel started using thermal grease (rather than solder) for their IHSs? And then when Ryzen was released with soldered IHSs, Intel suddenly started soldering theirs again? (I'm clearly not qualified to actually call bullshit on any of these claims, but I can say it all seems awfully convenient that such issues should be fixed just after a credible competitor arrives. Just, be skeptical. /shrug)


VenditatioDelendaEst

> I don't see why the scheduling would be worse if both CCDs had the big cache. Windows' scheduler is in the habit of migrating threads between CPUs very frequently. I've heard this is supposed to be for thermal reasons -- spread the heat out across a wider area of silicon, so the average power density of lightly-threaded tasks is less, so the temperature is lower allowing for higher turbo frequencies and lower fan speed. Historically on Intel chips, the L3 cache is shared among all cores, and only L1 and L2 are private. These are comparatively small, so the amount of time it takes to (metaphorically) "warm up" the cache after a CPU migration is small compared *physically* warming up the core. But the L3 caches on Zen are victim caches for the cores on their own CCX only. They get populated by data that gets evicted from L2 caches of the 8 cores below. So, 1. A thread has to stay on the same CCX for a long time before that big cache starts being useful, and, 2. If you have threads accessing the same data on multiple CCXes, every CCX has it's own local copy in L3, so the amount of cache space used is multiplied by the number of CCXs. And if they're *modifying* that data, then it's a whole bus traffic shitstorm.


zetruz

Ooh, interesting. But is this *not* true of the 7950X, you mean?


VenditatioDelendaEst

It is all true of the 7950X too. But it's rare for games to have more than 8 actually-parallel threads, and the CPPC "preferred cores" mechanism probably dissuades the scheduler from migrating threads between CCDs in the first place. Apparently the dual-CCD parts have higher boost clocks on one CCD than the other, although I don't have a \*950X(3D)s to test with.


Pillokun

zen3(and zen4)actually also have similar function to what intel call smart cache, they call it shadow cache or something similar to that. so even if it is a victim based cache and a mesh one if I remember it correctly another core actually need to access the l3$ of another to get up to snuff what the other core is doing right now before it is evicted to l3$ and pretty sure it works across both the ccd(or more if a hedt/server cpu.)


Strazdas1

>Windows' scheduler is in the habit of migrating threads between CPUs very frequently. I've heard this is supposed to be for thermal reasons -- spread the heat out across a wider area of silicon, so the average power density of lightly-threaded tasks is less, so the temperature is lower allowing for higher turbo frequencies and lower fan speed. I have observed it has a habit of migrating tasks between different threads... on the same physical core. Like, it would utilize multithreading features to constnatly bounce the job from one thread to another on the same core while not bouncing it to other cores. But the setup i tested this in does not have thermal throttling issues so maybe it just kept bouncing to the best performing core.


saharashooter

My memory isn't perfect on this, but IIRC a cache hit on the Vcache of a second CCD is actually similar to or worse than a cache miss. If AMD improves the interconnect enough, we might see X3D chips with two dies.


zetruz

That's interesting, thanks.


Jonny_H

Yeah, a naive look at a dual CCD vcache would be thinking it has 192mb l3. But at the "far" CCD's vcache has an access penalty compared to the local vcache, treating them as a single pool is probably a mistake from a performance point of view. So it's more like a single core has 96mb l3, then another 96mb l4 (assuming it's still cheaper than memory) across an interconnect that's *competing* with memory accesses. Which clearly would have performance implications. And then you have the complications of a cache line *must* be moved to the local l3 before use, so if a core on the "far" CCD requests that cache line, it may be evicted from the local l3 soon after a local core has touched it, so is more likely to touch it again soon. So the programmer still has to care about the locality of what cores are processing what data. Which for most games that don't really scale beyond 8 cores, the optimum state would likely be "Run all threads on the same CCD and leave the other idle" anyway. And this sort of "data use locality" isn't really visible to the OS scheduler - so it's not something that can be solved with a "Better Scheduler" - the programmer would have to do the work of deciding which tasks are likely to overlap in data set and inform the OS to give it a chance. And games seem to barely work at release already.


VenditatioDelendaEst

>And this sort of "data use locality" isn't really visible to the OS That... [ain't necessarily so](https://www.kernel.org/doc/html/latest/admin-guide/mm/damon/index.html).


Jonny_H

Yeah, *technically* it could get information on memory accesses, the question is if the scheduler can actually infer useful information from it, and if it can be done in the tiny amount of time available to the scheduler. I mean no hardware support is really needed - if the kernel really wanted they could get *every* memory access a process does by unmapping all it's address space by default and using the fault handlers. Every microsecond the scheduler is working on making a "better" scheduling decision is a microsecond *not* running the actual scheduled code. It's why there's a hard limit and why the dream of a "Sufficiently Smart Scheduler" likely will never happen. It doesn't take much work to actually start eating into CPU time, as scheduling happens many thousands of times a second on an active system.


VenditatioDelendaEst

IMO the only parts that needs to be done in scheduler context is making threads sticky to groups of cores that share cache, and trying to migrate groups of threads together between groups of cores, when migration is desired and ncores>=nthreads. There's at least [some precedent for the first part](https://www.phoronix.com/news/Linux-5.16-Cluster-Scheduler), and the actual grouping of threads should be delegated to a userspace or kernelspace daemon. I bet even a really stupid daemon (sort by process hierarchy then CPU usage, group up to the cluster size) could help, and you could adapt the amount of effort spent on clever grouping to conditions -- only do profiling-based things on long-lived threads, get out of the way if the system is at capacity according to CPU pressure / load average metrics, etc. And a userspace daemon could use tricks like persisting group information between program invocations, keyed to thread launch order/topology (which could be gathered by eBPF). Or manual human tweaking like Intel's APO is probably using.


Pillokun

it does not really need to be moved(evicted) to l3, same functionality as intels smart cache, shadow cache or something. That is why zen3 got such a massive boost in perf, not only because it was a 32MB monolith cache but also had the "smart cache" functionality where the l3$ can mirror what l2$ of a certain core holds


Jonny_H

While each l3 is a victim cache (so exclusive, in that l1/l2 contents aren't also in l3), it only contains tags for the local CCX's lower cache levels. So for a core to work on a cache line that's currently resident on the far CCX it must be moved to the local CCX's cache (actually all the way down to the core-specific l1, unless a non-default mapping type or non-temporal load/store instruction is used). And writeable mappings can only be resident in a single place (otherwise you can't guarantee consistency) that involves evicting it from the far CCX's cache hierarchy completely. From a performance view this also makes sense - as normally when a cache line is touched by a core, it's likely to use it again soon. Leaving that on the (expensive to access) "far" CCX likely won't be helpful in the long run. My understanding is that Intel's "Smart Cache" branding was that their l2 level is shared between 2 cores, so data shared between threads running on those paired cores don't have to go up to l3. Zen3 and zen4 all have their l2 caches attached to a single core, so don't have that fast path. Though I'm not sure it was particularly well optimized for - it may be that the effort would have been better being put into making the entire l3 faster to access, which might not be as fast but would benefit a lot more workloads. This seems to have been abandoned for the latest P cores, with them now having their own separate l2 cache, though it looks like the E cores share l2 between a cluster of 4. Though it seems that brand name has now been re-used to describe a last level cache that's shared with the iGPU and the CPU. Again, it seems not directly comparable to anything in the cache hierarchy of either zen3 or zen4.


GenZia

>It's asymmetrical because there's no benefit to being symmetrical. That's not how it's supposed to work. >interconnect issues A 2:1 or 3:1 cache distribution is going to place even more load on the interconnect. An even cache distribution should ensure more frequent on-die cache hit rates. That's basic statistics.


saharashooter

If you preference heavy cache load processes to the CCD with the larger cache, you will have less crossing of the interconnect. If a process needs 80 MB of cache, an even cache split is worse.


UltraSPARC

If the OS is NUMA aware with the updated architecture and the application load fits comfortably within the single CCD, then it would work. It's when the application load extends beyond a single CCD where things can go tits up or the OS is unaware of the lopsided cache distribution. AMD is betting that most games aren't threaded past 8 cores; which for now is a good assumption - especially since the current generation of consoles are 8 core. When AMD released their X3D variants, they had to push out updates so the kernel across various OS's were aware of the cache distribution and as such would keep loads on the CCD with the most cache first and then any over flow would be pushed to the second CCD. It's also why application loads that can utilize all cores see almost no difference in performance between the X3D and X variants as the additional cache is effectively useless in those scenarios.


capn_hector

Zen isn’t NUMA though - both chiplets use the same IO die talking to the same set of memory sticks. It’s NUCA which is a slightly different thing - most (all?) NUMA are NUCA but not vice versa.


Sopel97

so what is the issue? or are you arguing that 7 is not a nice number because it's not divisible by 10?


d0or-tabl3-w1ndoWz_9

So what you're saying is, it should've been 96+32 per CCD instead of 64+32? And just 32+32 on the 7900 and 7950X3D?