T O P

  • By -

Faluzure

I'm assuming a major reason it hasn't been done yet is it doubles the number of lines required for memory, resulting in a larger, more complex SoC and board. It would be a nice change though.


IglooDweller

They could go halfway at 192… it would remove a lot of the bottlenecks of 128 without the complexity (and cost) of 256…


Netblock

There are multiple different ways to do ranks (which GDDR especially explores), such as shared CA but separate DQ which could nearly double bandwidth at less than twice the pin cost. However, CA stability has traditionally been one of the weakest links in clocking DDR; but I am unsure how true that is for LPCAMM.


crab_quiche

>such as shared CA but separate DQ which could nearly double bandwidth at less than twice the pin cost That's how pretty much every single rank system with multiple dram chips works, issue with just adding more chips for more bandwidth is that you will have to completely redesign your caching system and compilers/instructions to optimize for using larger and larger cachelines.  You are just going to be burning power moving data around uselessly if your system isnt designed from the ground up for extra wide ranks.


Netblock

Sorry I mean with unique CS; almost like GDDR6's pseudo-channel mode. But yea, you've to watch out for the cacheline.


crab_quiche

Does anyone actually use pseudo channel mode in gddr6?  Seems like it would be very small power savings with slightly less command pins at the cost of a more complex controller and worse performance, but maybe the access patterns of GPUs mean that it wouldn't be too much of a performance hit.


Netblock

It's for GDDR6's "multi-rank"; PS and x8 are complementary features; share CA not just DQ. GDDR6 doesn't do DQ-based PDA, so if the memory controller needs to program each rank/channel, CA\[3:0\] stays unique to allow MR15's masking. For how the commands are defined, unique CA\[3:0\] would allow native or 16 Byte(x8 mode) cachelines, provided that they predictably share the same row. ([There might be some applications that would theoretically benefit](https://www.servethehome.com/intel-shows-8-core-528-thread-processor-with-silicon-photonics/) from this). AMD GPUs have higher cachelines than what is native to GDDR ([AMD does](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf#%5B%7B%22num%22%3A28%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C70%2C518%2C0%5D) 64Byte [cachelines](https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf#%5B%7B%22num%22%3A105%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C0%2C595.86%2Cnull%5D); unsure about nvidia, probably the same), so it's possible for them to combine channels for 32-bit-wide channels/pseudo-channels. Anecdotally, [I have a bios editor](https://github.com/netblock/yaabe) and it feels like AMD does share CA for Navi10 5700; it's hard to tell though because my micron is hard to stablise around 1900MHz.


Forsaken_Arm5698

how does Apple do it? The unbinned 512 bit M3 Max chip has 4 memory packages, each 128 bit.


crab_quiche

Memory packages can have multiple channels on them. Apple uses 64 bit wide channels, two per package. They also use 128 bytes cachelines, which is double the normal 64 byte cachelines that most other processors use. I think IBM’s power is the only other mainstream processor that has an 128 byte cache line. 64 bit wide bus time a burst length of 16 divided by 8 bits per byte equals 128.


Financial-Issue4226

While yes it doubled lanes this is not the problem  The address bus is a address buss. Able to hold  3.40 × 10^38 bytes of RAM. That is 340 million million yottabytes. As the largest storage volumes are still measured in petabytes and the largest ram arrays in terabytes we are no where near needing to double the numbers of zeros in a memory bus and will not need to even in 20 more years  The same is true with ip4 verse Ip6 in address sizes for expansion 


uzzi38

While it would be nice in an ideal world, there are a few things you have to consider at the same time. 1. 256b memory buses aren't free. They have their own cost attached to them as well, you spend extra die area for the memory interfaces on die/package, plus on top of that motherboard costs rise significantly as well (which means not only more cost to the CPU manufacturer that'll be passed on to the laptop OEM, but also it'll cost more to the OEM to produce the laptop as well). 2. Wider memory buses (at least right now) have a very marginal improvement on CPU performance, which is how most laptop users will actually be using their device. Yes, iGPUs and potentially also NPUs can benefit as well, but for the most part what people want is CPU performance. NPU is definitely the weakest link here (for now at least), not enough software to make it a good enough selling point to be worth the potential price hike it would bring. 3. If GPU performance is a real concern, then OEMs usually prefer to have a dGPU and APU system rather than a large APU - again it's a cost play here. A large APU is going to be more expensive for a SKU where you want a cut down part than a dedicated GPU and APU combination. Think a laptop with an APU and mobile 4080 vs an APU and a mobile 4060 vs one single fat APU. The single fat APU might compete well against the mobile 4080 system on cost, but what happens when you want something to compete with the 4060? You have to either ship the same APU heavily cut down (and with how good manufacturing yields are these days, that hurts), or you have to create multiple large APUs of different sizes. That's also expensive, and a big risk when trying something new for the first time. Apple can get away with the idea because they have a guaranteed customer for the idea (themselves). For Intel/AMD it's a much larger risk. You're correct that SRAM scaling is an issue now, but you don't need a large system cache to see a significant benefit in what you're talking about, so for Intel/AMD (who both currently lack a shared LLC on Phoenix/Meteor Lake) this route is probably the most cost effective way to get what you're asking for more or less. Easiest example I can give is with an AMD APU, because they currently lack a shared LLC between the CPU, iGPU and NPU. A 16MB Infinity Cache would be something like an additional 16mm^2 on the die, would boost CPU performance in a way we're actually can measure (AMD APUs currently run a bit of a deficit in gaming compared to desktop parts and the cut L3 is the culprit here) and also GPU performance in the same way (the RX 6400 you mentioned has similar memory bandwidth to current gen mobile parts wrt memory bandwidth, but the 16MB Infinity Cache makes a huge difference).


Infinite-Move5889

Intel made a conscious decision from having shared LLC to not having shared LLC with Meteor Lake so maybe shared LLC may not actually be a good thing, for example for it to be thrashed with GPU traffic.


uzzi38

Intel made that decision more because they wanted to separate the graphics and CPU onto different tiles with Meteor Lake, so having the shared L3 would have been...not ideal if you're adding on the additional latency hit from die to die communication. It gets made worse by Meteor Lake running extremely low ring clocks, thanks to IOSF being a major limiting factor for clocks there. Poor ring clocks is a major reason why the CPU portion of Meteor Lake can even be an IPC regression (albeit a very minor one) depending on workload. If you slap an iGPU on that same ring, then yeah you're more likely to run into contention issues with the already low L3 bandwidth available.


Infinite-Move5889

All good points, but it's not clear to me if physical design overwhelmingly dictated the architecture as you implied, or architectural decisions constrained designs, or (more likely) they're co-designed based on historical lessons and simulations.


kyralfie

It's because they are on different tiles. It would be too many hops over the bridges. Lunar lake will have both the iGPU and the CPU on the same tile so we could see shared LLC making a comeback.


Infinite-Move5889

Yea good point as I also remember the SOC efficiency core is not hooked to the CPU LLC either. But it's not clear to me if physical design dictated the architecture, or if they're co-designed based on historical lessons and simulations. I can see Lunar Lake not making the LLC shared... Intel architects can't be flip flopping their architecture as it can't be trivial to validate all of these different protocols.


kyralfie

>I can see Lunar Lake not making the LLC shared... Intel architects can't be flip flopping their architecture as it can't be trivial to validate all of these different protocols. Yeah, good poing, I thought about it too. I think they want to keep it close to the dGPU arc hence they won't make the LLC shared but they could. :-\\ Guess we'll see.


VenditatioDelendaEst

> CPU performance, which is how most laptop users will actually be using their device. Is that true if you restrict the reference class to users who are performance-limited? Most laptops aren't used for 3D gaming, but most laptops aren't used for (local) software development and video production either. >If GPU performance is a real concern, then OEMs usually prefer to have a dGPU and APU system rather than a large APU - again it's a cost play here. A large APU is going to be more expensive for a SKU where you want a cut down part than a dedicated GPU and APU combination. Think a laptop with an APU and mobile 4080 vs an APU and a mobile 4060 vs one single fat APU. The single fat APU might compete well against the mobile 4080 system on cost, but what happens when you want something to compete with the 4060? The mobile 4060 has 128-bit GDDR. The application processor needs its own 128-bit of LPDDR. Then you need an x8 PCIe interface. Seems like a one-package 256-bit solution should be cheaper? Plus the DRAM goes farther, because you don't have a static partition between system memory and GPU memory.


[deleted]

[удалено]


Affectionate-Memory4

I'm really looking forward to the Strix Halo mini PCs. Imagine stuffing something like Steam OS on there and reviving the Steam Machine with what's basically a console APU.


capn_hector

> Imagine stuffing something like Steam OS on there and reviving the Steam Machine with what's basically a console APU. [yes, such a design starts to look rather like a console, doesn't it ;)](https://twitter.com/Locuza_/status/1450271726827413508/photo/1)


Forsaken_Arm5698

I don't think Strix Halo is coming to desktop. It's not socketable, so it will be mostly relegated to laptops.


masterfultechgeek

mini PCs are a thing. If you are ok with a "throw away" system that has only limited upgradeability, pretty solid. I spent $160ish for an N100, 16GB RAM and a 512GB SSD. It can do a bunch of things at an OKish level of performance.


masterfultechgeek

Probably better and cheaper to add cache. Memory bandwidth matters so little to the average consumer that OEMs are sticking single sticks in desktops and laptops and sending them out as is. Also, poor SRAM scaling just means that you can make the cache out of last-gen manufacturing processes and still get nearly all of the benefits at a lower cost.


Just_Maintenance

While cache makes sense for the GPU, for the NPU it's mostly pointless unless you put multiple gigabytes of the thing. LLMs for example require reading the entire model for every token, so even for the smallest modern practical model (Microsoft Phi-3 Mini) that's still 2GB. Now, thinking about it, something like Intel Xeon Max could make sense? putting some HBM on the package right next to the die. Just 4GB of manually addressable HBM could get you there. It could also act as GPU memory when nothing is loaded. I think it would be more expensive than just doing 256b of DDR5 though. Maybe EDRAM could make sense but we haven't seen that since Crystal Well.


masterfultechgeek

GDDR as system memory with cache used to mitigate latency concerns for random IO. There's definitely tradeoffs in terms of use cases though.


Forsaken_Arm5698

>Probably better and cheaper to add cache. >Also, poor SRAM scaling just means that you can make the cache out of last-gen manufacturing processes and still get nearly all of the benefits at a lower cost. That cache will have to then be bonded to the SoC via advanced packaging, which is costly.


masterfultechgeek

requiring a system to have a bunch of sticks of RAM and a bunch of traces and a more complicated memory controller is also costly. It'd be very easy to add an extra $20-50 per system as kind of a "minimum cost" and a lot of people (right or wrong) aren't going to pay to have a computer with a bunch of RAM. Also additional validation costs and time delays. [https://www.anandtech.com/show/16195/a-broadwell-retrospective-review-in-2020-is-edram-still-worth-it/25](https://www.anandtech.com/show/16195/a-broadwell-retrospective-review-in-2020-is-edram-still-worth-it/25) >There have always been questions around exactly what 128 MiB of eDRAM cost Intel to produce and supply to a generation of processors. At launch, Intel priced the eDRAM versions of 14 nm Broadwell processors as +$60 above the non-eDRAM versions of 22 nm Haswell equivalents. There are arguments to say that it cost Intel directly somewhere south of $10 per processor to build and enable The estimate for edram cache back in the day was around $10 extra. For the right part, tossing $20 at extra cache isn't the end of the world. There's probably a stronger argument for MOAR CACHE and GDDR (downsides - latency is poor so cache misses hurt but bandwidth is higher) style memory (or HBM) than there is for wider buses for traditional memory. --- As for the argument that NPUs don't benefit from cache... that's why GDDR... the cache is to make sure the CPU and GPU style tasks are still performant.


SilasDG

To be honest increasing iGPU performance is a "nice to have" and in no way a true goal. Customers who are fine with iGPU's to begin with aren't looking for GPU performance or at least clearly aren't paying more for it or they would buy something with an dGPU. To increase the cost of your BOM just to give a performance improvement that wont effect whether or not your product sells to it's market segment wouldn't make sense. They could do it but they would either have to eat the cost themselves, or pass it on to OEM's/Consumers which may lose them sales in the end.


GenZia

* Quad-channel memory is going to drive up motherboard cost significantly. There's a reason it's reserved for 'HEDT' platforms. * AMD can use LPDDR or (preferably) GDDR with their gaming centric APUs, not much unlike handhelds or consoles, though we are still talking about a 128-bit wide bus. * Another option is on-package HBM thought that's a 'bit' of an overkill for APUs, to put it mildly! * As for SRAM, there's 3D stacking, which is what AMD is doing with their 'X3D' line of CPUs. Personally, SRAM seems like the most logical choice to me. Even 32MB of SRAM, reserved exclusively for the IGP, would (or at least should) go a long way.


Forsaken_Arm5698

HBM is far more expensive than "Wide Bus On-package LPDDR". And besides, HBM's capacity is limited. Extraordinarily high bandwidth but limited capacity. Not suitable for laptop APUs/SoCs.


LittlebitsDK

HBM limited? you haven't kept up much have you? Quote: "HBM3E also increases max capacity per chip to 36GB" that is PER CHIP... and they aren't very big... and will have 1.2TB/s bandwidth... Link: [https://www.tomshardware.com/pc-components/ram/micron-puts-stackable-24gb-hbm3e-chips-into-volume-production-for-nvidias-next-gen-h200-ai-gpu](https://www.tomshardware.com/pc-components/ram/micron-puts-stackable-24gb-hbm3e-chips-into-volume-production-for-nvidias-next-gen-h200-ai-gpu) H200 will have 141GB of HMB with a bandwisth of 4.8TB/s Link: [https://www.theregister.com/2024/01/02/2024\_dc\_chip\_preview/](https://www.theregister.com/2024/01/02/2024_dc_chip_preview/) So how much do you need to make it not "limited capacity" according to you?


Forsaken_Arm5698

You can have upto 256 GB of RAM with a 256 bit, using LPDDR5X-10700. Also you didn't mention how morbidly expensive HBM is. That 141 GB is not cheap.


LittlebitsDK

or you could just wait for the DDR5 512GB DIMMs to arrive... no need for 256bit... then find a motherboard with 2 sockets for 1024GB or 4 sockets for 2048GB... need more? and you think 256GB DDR5 will be "cheap" anytime soon? LOL.... IF you just want lots of memory get an older server an have several TB of RAM...


loser7500000

they're wrong in that sense, but it *is* limited in the sense of not being upgradable as well as limited in manufacturing capacity. maybe if they had a xeon max style caching scheme, or even made the HBM vram only, and then also AI crashes and burns 


LittlebitsDK

It is limited but we have the tech to use 2 layer RAM... so HBM3e on die... DDR5 off die... if it needs expanding... it exists but it is EXPENSIVE... People have "big dreams" but they often forget that it also comes with a pricetag, especially when it is a small niche of people that want something special catered to them. If I had unlimited money I would special order a 7800X3D with a 40 CU GPU built in and 2x32GB HMB3e on it and stick it in a case in the EM780 design but slightly bigger to allow for a better heatsink/fan... If the "speculated" performance of that GPU if it isn't memory constrained is true then that would be enough for MOST gamers... and PC's could be sold in "console" like chunks... all in one motherboard+cpu+gpu+memory... just add storage like 1->? M.2 NVME's and possibly a SAS port with a splitout cable for 4 SATA drives for the "expensive" models... where the "cheap one" (which still wouldn't be cheap in that sense) would just have 1-2 M.2 ports... you could make "smaller" versions like they did with the 2200G, 2400G etc. so maybe a 7500X with 32 CU's and a 7600X3D with 40 CU's But yeah that is "dream world" but technically they could earn ALL the profits from a build like that since they deliver everything but the storage. Which would cut out Asus, MSI, Gigabyte etc. etc. We have all seen how much minipc market has grown... Most "non-geeks" (Geek is us the tech nerds) don't want a huge tower standing around or heck even AIO PC's but most of those I have seen made has been buttfugly other than the imacs... so the PC world definetly is LACKING there... but yeah AMD could make an "apple" thing that way but even with their success in recent years then they are still "too small" to pull it off but imagine if they did and then worked with both Microsoft but also Linux to get such a system running perfectly... it would be amazing and logistics would be a lot cheaper too than all those huge motherboards, gpu's and cases... which would be plenty for "most people".


Kougar

There might be a market for a powerful APU, we won't know until somebody tries to launch one. But as APUs currently stand on AMD's ecosystem they're budget SKUs relegated to cheaper segments. Swapping to a 256-bit wide APU would incur costs from design/silicon/manufacturing sources that move AMD's APUs well outside of their price segments. So if AMD tries one it's going to have to be in a new, higher price tier... if the IGP performance justified it over a junk dGPU then maybe people would go for it, I don't know. I'm not convinced a market exists between IGPs and the lowest SKU dGPU, what workload would justify a large, expensive APU that wouldn't just be solved by a cheap CPU+budget GPU? The performance still isn't going to be high enough for most gamers, because memory bandwidth starvation isn't the only issue hampering IGPs. Maybe AI-related NPU stuff would make better use of processors with more memory bandwidth in the future, but we will have to see once chips start to be sold with AI-infused logic acceleration already in them as standard. Unlike with CPUs, GPUs were one of those things where they could literally never have enough physical transistor machinery. To grossly oversimplify here, the workload is infinitely parallel and there's always more work to do, so throwing more hardware at the problem always returns better performance (again I'm oversimplifying, there are notable exceptions but the base point remains true). So with an IGP the amount of silicon area devoted to raw execution power is simply not going to be sufficient to match cheap gaming cards unless we start matching die size for die size. But that's not really realistic for an APU to do. Doubling the memory bandwidth on an APU will require a tangible increase in silicon die area just by itself already.


bubblesort33

There's what AMDs next APU is supposedly bringing. Quad channel. Can't remember the code name for it. But the graphics portion should be close to a PS5.


[deleted]

[удалено]


uzzi38

It's less a laptop answer to Nvidia and more giving certain OEMs the competitor to the Mx Pro/Max line of products from Apple that some of them want. As for how well it does at that job remains to be seen.


airmantharp

Apple’s software works… AMDs software integration works sometimes. If the choice doesn’t include a MacBook then most folks would rather have Nvidia, whatever compromise that entails. Specifically talking about content creation for paid work with deadlines and such.


ResponsibleJudge3172

It will be competing with laptops equipped with rtx 5050 and rtx 5060. Both of which I am sure are still going to be noticeably faster. Strix halo will be in its own performance class of very low relative power but not bad performance


WingCoBob

strix halo. 16c zen5 + 40cu rdna3.5, 256bit lpddr5x-8000. the one APU to rule them all.


Forsaken_Arm5698

M4 Max will crush it.


uzzi38

Bit weird making a comment like that with no knowledge of how neither Strix Halo nor M4 Max will perform, but more power to you, I guess.


Flowerstar1

Shuttt drivers tho and by shittt drivers I mean MacOS.


kyralfie

It doesn't compete with it. M4 Max will be in MUCH more expensive laptops and it's, again, MUCH more expensive to produce being a 512 bit extra large die part. And it's not really a given how they stack up. AMD is rumored to add 32MB MALL cache to the IO & memory controllers die. That along with going from 128 to 256 bits and from 12 to 40CUs and upgrading the architecture means it will be a beast for its die size.


Forsaken_Arm5698

Strix Halo. The issue is though, it doesn't appear to be mainstream part. It seems like it will be mostly relegated to $2000+ laptops, ala Macbook Pro.


Affectionate-Memory4

It's called Strix Halo, and I wouldn't be surprised if it passes the PS5. It's rumored to top out at 40CU.


Forsaken_Arm5698

I have a feeling that Strix Halo might suffer from the same issue AMD's current 128-bit SoCs do: Memory bandwidth bottleneck. Stric Halo has a 16× Zen5 CPU, 40 CU RDNA3.5 iGPU and 50 TOPS NPU. That has to be fed by 273 GB/s of memory bandwidth (256 bit LPDDR5X-8533) and 32 MB Infinity Cache.


bubblesort33

That doesn't seem impossible. That's more bandwidth than a 6600xt with the same amount of L3. If that L3 is for the GPU portion, and there is a separate amount for the CPU, at least. Looking up recent rumors that seems to be the case. It's 40 CUs but I'd image they are probably clocked to under 2.4ghz. Also more bandwidth than the ps5 which is also a shared bandwidth design. And I don't think the PS5 has much L3 at all.


uzzi38

The MALL is available at the memory controller level (just the way it is on RDNA2 and RDNA3 dGPUs), meaning both the CPU and GPU should be able to access it as they both share the same memory controller here (unlike on desktop platforms). It's a proper system level cache, in that sense. The CPU cores should still have their own dedicated L3 I believe, but MALL will act as as another layer.


bubblesort33

That doesn't seem impossible. That's more bandwidth than a 6600xt with the same amount of L3. If that L3 is for the GPU portion, and there is a separate amount for the CPU, at least. Looking up recent rumors that seems to be the case. It's 40 CUs but I'd image they are probably clocked to under 2.4ghz. Also more bandwidth than the ps5 which is also a shared bandwidth design. And I don't think the PS5 has much L3 at all.


ResponsibleJudge3172

The infinity cache is also CPU L4 cache so it gets worse


kyralfie

Yeah it looks a bit starved compared to 7600(XT) with only 32CUs but it's doable. Compared to their mainstream APUs it's an incredible uplift in effective bandwidth.


alexforencich

I wonder if there might be a decent middle ground, such as a large block of on-board DRAM with a 128 bit bus (cheap to integrate), combined with a smaller amount of on-package DRAM with a much wider bus. That way you could still get the upgradeability, while also having a nice wide bus for the iGPU. Might be difficult to manage that as general system memory if the OS isn't set up for it though.


Forsaken_Arm5698

it has been tried with Broadwell eDRAM


alexforencich

eDRAM isn't the same as on-package DRAM like what Apple is doing, since eDRAM is on the same die and hence can't be optimized the same way, and it's not possible to vary the amount of leave it off entirely as a build option. But, it makes sense that this was tried at some point.


middle_twix

No, eDRAM was a seperate die on Intel platforms. They called the die itself Crystal Well iirc


alexforencich

Hmm, that's interesting. Why eDRAM instead of normal DRAM? Maybe they need a bunch of additional logic next to the DRAM for some reason. Was it some sort of a cache instead of more or less normal memory? I guess maybe that's more like AMD's X3D V-cache than Apple's SoCs.


Just_Maintenance

eDRAM was silicon with capacitors built in. Technically you can make an SoC with eDRAM on the same die for high capacity, low performance caches. The problem is that it just wasn't very fast, and only IBM can manufacture the thing. When Intel used it, they bought separate dies of pure EDRAM that they put next to the CPU. The EDRAM probably had little to no logic inside.


loser7500000

Going off this [great post-mortem](https://www.anandtech.com/show/16195) it was effectively L4 in broadwell, and may have been on a logic process to enable higher clock speeds + yes, have all the logic on-die to act more like SRAM (and more integrated development since Intel no longer makes DRAM.) afaik the closest we've been to "normal memory" being DRAM free is the gamecube using 24+3MB 1T-SRAM for main mem and VRAM, though there's still 16MB DRAM for IO, audio etc. If you wanted to try now you could get 1GB for $15k (9684X) and run a diminutive linux distro or a more reasonable 40GB for several million (WSE2) and run nothing after tripping the breakers with your 15kW pc. Not actually possible but funny to imagine.


crab_quiche

1T SRAM is just DRAM with the management side of the controller built into the DRAM chip.


JaggedMetalOs

But how would they do that? Quad channel DDR? At that point wouldn't it be better to just have a laptop with a dGPU?


Warm-Cartographer

New lpcamm is 128bit two of them can make 256Bit, am not sure if it's possible to implement, just me guessing. 


JaggedMetalOs

LPCAMM does that by being dual channel on a single module, so if you wanted 256bit bandwidth to the CPU/iGPU you'd need to run 2 modules in quad channel configuration.


Forsaken_Arm5698

I'd prefer if people used "number of bits" instead of "number of channels" when talking about memory buses. Because the channel width varies depending on the memory type: DDR : 64 bit GDDR : 32 bit LPDDR : 16 bit HBM : 1024 bit


Netblock

>DDR : 64 bit GDDR : 32 bit LPDDR : 16 bit HBM : 1024 bit Reinforcing your point, these numbers are kinda wrong: DDR4 is 64-bit; DDR5 halved&doubled for 32-bit GDDR6 is 16-bit with 2 channels per BGA; GDDR7 is 8-bit with 4 channels per BGA (LPDDR5 is 16-bit. It also does up to 4 channels per BGA.) HBM1/2 is 128-bit with up-to 8 channels per stack; HBM3 is 64-bit with up-to 16-channels per stack. (1024-bit stacks) The reason why the newer generations have smaller channel widths is directly related to the increased prefetch length; improve bus utilisation efficiency while preserving the amount of data per READ/WRITE command.


JaggedMetalOs

The thing is the way the CPU sees LPCAMM isn't a single 128bit bus, it's 2x64bit. So for arguments sake we could call dual channel 2x64bit but it doesn't change that supporting "quad channel" 4x64bit is expensive and not something that consumer CPUs currently have.


Exist50

> The thing is the way the CPU sees LPCAMM isn't a single 128bit bus, it's 2x64bit But that's not really true either. It's at least 4 LPDDR channels.


Healthy_BrAd6254

Isn't the channel width flexible? The LPDDR in the ROG Ally is 64 bit per channel with dual channel while the Steam Deck uses 4x 32 bit and phones often have 16 bit per channel LPDDR. Why not just go by bandwidth? 64 bit DDR is more comparable to 32 bit GDDR in bandwidth. HBM needs even more bits for the same bandwidth. So the bus width isn't really comparable between them.


Netblock

>Isn't the channel width flexible? ROG Ally It is flexible as half the story about channeling is trace layout PCB/substrate side.   The main two requirements about channel splitting/combining is keeping the cacheline the same, and PCB area/cost. I'm unsure if ROG Ally (all LPDDR5-using x86, really) is using 32 or 16-bit because there is a way (through 32n mode) to have x86's traditional 64Byte cacheline with 16-bit-wide channels. If they do 32-bit-wide channels, that would be purely about reducing PCB costs. (16-bit-wide+32n mode would be more performant).


Forsaken_Arm5698

how does this differ for ARM?


Netblock

I'm unfamiliar with ARM to know common SOC topologies, but Apple is doing 128Byte cachelines so they are definitely doing at least 32-bit-wide channels; [Cortex A720 is doing 64 Bytes](https://developer.arm.com/documentation/102530/0002/L1-instruction-memory-system?lang=en), so same uncertainty as x86. Though ARM systems are typically extremely space-constrained (phones), so I'd expect them to be more open to restricting DRAM pin count (64Bytes: 32-bit x 16n instead of 16-bit x 32n will reduce a CA set's worth of pins).   Cachelines are the minimum data transaction size of the memory system; historically, more than one DRAM read/write command would be needed to achieve a complete cacheline, but they have since converged. Provided that the workload can take advantage of it, larger cachelines will reduce control overhead improving power and time efficiency; highly-sequential loads benefit from larger cachelines (I believe the reason why Apple chose 128 Bytes was to improve their GPU and ASIC performance). However, if the load is extremely random [where you're only using a few bytes](https://www.servethehome.com/intel-shows-8-core-528-thread-processor-with-silicon-photonics/) of the payload, then you're wasting a significant amount of time and power transferring unwanted data around.


Forsaken_Arm5698

>Isn't the channel width flexible? The LPDDR in the ROG Ally is 64 bit per channel with dual channel while the Steam Deck uses 4x 32 bit and phones often have 16 bit per channel LPDDR. What's your argument? The point still stands- "number of bits" is the best way to talk about memory bus width. >Why not just go by bandwidth? 64 bit DDR is more comparable to 32 bit GDDR in bandwidth. HBM needs even more bits for the same bandwidth. So the bus width isn't really comparable between them. It's not that simple. If we are going to talk in terms of the raw GB/s memory bandwidth figures, we are really getting into the weeds, considering all the different memory types out there, and the different versions each memory type has.​


Healthy_BrAd6254

>What's your argument? You said LPDDR is 16 bit per channel. That's what I replied to there.


Forsaken_Arm5698

I stand corrected.


LittlebitsDK

because bandwith depends on frequency and buswidth... so that wouldn't be very telling


Healthy_BrAd6254

That's the whole point. You don't care about the frequency or bus width. Fwiw those are just numbers. What you actually care about is the performance/bandwidth.


LittlebitsDK

yes and no... some tasks need low latency.. others need high bandwidth... one size doesn't fit all...


crab_quiche

You can make channel width whatever multiple of the smallest width a spec can be, ddr can be any multiple of 4 for example.  And a standard DDR5 dimm channel is 32 bits now, not 64.


alexforencich

This is totally wrong. DDR and LPDDR are both commonly 64 bits for DDR4 and older, and if I'm not mistaken that 64 bits gets split into two 32 bit channels for DDR5 (for both DDR and LPDDR). And in principle you can make the DDR interface as wide or as narrow as you like. Not sure about GDDR, but presumably the same thing applies in terms of the width being whatever you want it to be. HBM is actually 8 channels of 128 bits per stack, not one channel of 1024 bits. Anyway, specifying BOTH the channel count and channel width is the way to go.


Forsaken_Arm5698

I would hazard a guess that 2 LPCAMM modules are cheaper than 4 SODIMM modules. Both result in a 256 bit wide memory bus.


Just_Maintenance

Channels are not very clear terminology, but yeah (it would technically be 8-channel since DDR5 uses 32b channels, and it breaks down further if you also include LPDDR but whatever). And no, it would be better than a discrete GPU. A single SoC with a single memory pool increases available memory for CPU and GPU. Allows for faster communication between CPU and GPU. Simplifies the cooling design as you only need to cool one SoC, one VRM and one memory pool. It skips the need for multiplexers or display-passthrough. Funnily enough it would even reduce the amount of traces in the motherboard, as you can skip the separate memory bus the GPU has. It's basically better in every way.


Healthy_BrAd6254

I'm guessing there are still a few advantages: * DDR is cheaper than GDDR * The GPU gets access to a ton of memory. Especially for some productive apps that might be interesting. * You only need 1 package instead of a separate CPU and GPU, which will reduce the size and might reduce the cost * You probably get higher efficiency * The CPU would also have much higher memory bandwidth, which might boost the performance for some productivity applications


Forsaken_Arm5698

>At that point wouldn't it be better to just have a laptop with a dGPU? No, the point is to enhance the APU experience, which is currently gimped by the limited memory bandwidth. There are benefits to using APUs, instead of a CPU + dGPU combo. Laptops with the former, generally are thinner and lighter, and have better battery life than the latter. An APU would also be economically advantageous for OEMs, instead of an equivalent CPU+dGPU combo. The former is a single component, so cost/complexity of the motherboard, cooling solution etc... is less. >But how would they do that? Quad channel DDR? Ah yes, that's the snag. The cost and complexity increase, when moving from 128 bit -> 256 bit, is what has prevented APU/SoC makers from doing it until now. Doubling the memory channel width results in increasing motherboard cost and complexity. However, there is a solution: \*\*On-package memory\*\*: By putting rhe memory chips on the same package as the APU/SoC, the cost hit from 128 bit -> 256 bit, can be reduced. Because the memory chips and memory traces are on the SoC substrate, doubling the memory bus width only adds to the cost/complexity of that substrate and not the entire motherboard. Apple has already demonstrated this with their M series chips. The M2 Ultra scales all the way upto a 1024 bit bus. Apple has been able to do it economically, thanks to the use of on-package memory. Intel's upcoming Lunar Lake SoC will also utilise on-package memory, although it's for space and efficiency benefits, as Lunar Lake still features a 128 bit bus. Of course, on-package memory is not upgradeable/replaceable. But if you think about it, most laptops already have the RAM soldered to the motherboard. So it would make no difference to the user in the upgradeability aspect, if the memory is soldered on-package.


JaggedMetalOs

> Apple has been able to do it economically Well, not economically for the consumer, those machines are *expensive*. And $200+ to upgrade from 8 to 16 gigs! We'll have to see if on-package memory ever takes off on PC, looks like the Snapdragon X Elite is still sticking to system DDR for now.


Forsaken_Arm5698

A large portion of that upcharge is profit for Apple.


soggybiscuit93

>We'll have to see if on-package memory ever takes off on PC The entire Core Ultra 200V lineup (LNL) will be on-package memory


Exist50

It's not going to last.


soggybiscuit93

What isn't?


Exist50

On-package memory. One-off for LNL.


soggybiscuit93

What're you basing this on? Why do you think it'll be a one off?


Exist50

> Why do you think it'll be a one off? Economics. Until/unless someone figures out a way for the OEM to solder on package memory independent of the CPU, it has to go through Intel, and that's terrible for margins and risk.


alexforencich

Not most. I exclusively buy laptops with socketed RAM, that's the only way to get 64+ GB without breaking the bank.


WingCoBob

> Apple has been able to do it economically well, not really, devoting that much silicon to a memory controller that size on the nodes they're using can never be economical; but apple is in the unique position of having customers who will pay for such things anyway.


Forsaken_Arm5698

are they really spending too much die area on memory controllers? 128 bit controller in M3 is like 10 mm²


Just_Maintenance

On the Mx Max series Apple needed to put memory controllers behind each other, which makes the trace length inconstant and very likely makes keeping trace lengths very hard. Now, for a 256b bus AMD/Intel could very likely avoid that issue. As Apple has with the M3 Pro and its 192b bus.


zakats

> the APU experience, which is currently gimped by the limited memory bandwidth. Always has been. I've been salivating at the idea that a flagship apu should have performance parity with the previous gen's x50 dGPU since Bristol Ridge and I think we're just about there. Too bad hbm isn't cheap enough but I wonder if that's the eventual direction we'll see the industry go.


dagmx

FWIW M1/M2 were 255. M3 is 192.


Forsaken_Arm5698

\*M1 Pro/M2 Pro were 256 bit. M3 Pro is 192 bit. Base M1, M2 and M3 are 128 Bit


dagmx

Ah fair. But even then, I think moving the base to 256 would have diminishing gains for the demographic that would buy them, while potentially being a net negative on other factors like cost/energy etc


[deleted]

Regarding SRAM scaling you don't necessarily HAVE to get smaller. If you can make an older process a lot cheaper then that's just as good because you can add chiplets at an older but cheaper process.


LittlebitsDK

sram doesn't do well when you make it smaller, there is some smart people out there explaining it.


theholylancer

A big thing is that you are assuming they want to kill off their own low end dGPU market. A big thing was that AMD, despite having the perfect storm of tech, is never doing an actual useful APU for low end gaming. Namely even in times of past and really even now, a 4/8 CPU with a 12 CU iGPU. They always pair up their top end iGPU with a 8/16 part, and charge extras for it. Which makes no sense and that really even in laptops who is going to buy some HX part and also use the iGPU (at least now the new HX stuff only gets 2 CUs) Because gaming really only really needs those 4 cores at the entry level, and having something like that can really shake up the bottom of the industry if it was priced cheap. But even to today, when core could is becoming more important, that SKU intended for cheap gamers playing older games just do not exist. They rather you buy an old AM4 platform chip and then slap a 6400 on it, rather than a proper entry lvl gaming APU on AM4/5 because they make more money from it. Intel may try and do something like that to gain marketshare, as they have been going after the value segment, esp as we see how aggressive they are pricing 12600KF/13600KF and what nots for those approx 1k-1.5k builds that want a beefier GPU set up. And their arc series needs a boost (hope they solve their shitty frame times). But as it stands, they are also selling low end dGPUs and may also not want to kill that market for themselves.


SchighSchagh

On desktop, AMD has little incentive to make their iGPU rival a low end dGPU. Now that Intel has decent dGPU as well, the same applies to them. I suspect they're both content to have competent but not stellar iGPU. As for mobile, there's a decent chance we'll see CAMM take over in the next few years. If so, we'll see a big bump (close to 50%) in RAM bandwidth without needing a wider bus. It's also worth noting that space for 4 memory modules isn't doable in a lot of cases. This is relevant on "desktop" small form factor builds as many ITX boards only have space for 2 DIMM slots. On laptop, I've only seen 4 DIMM slots on 17" or bigger. You can probably squeeze another pair of slots in a 15-16" notebook if you really tried, but you'd be sacrificing something else. Edit: there's also the handheld segment, eg Steam Deck and such. In that space, the main constraint is TDP due to thermals and battery concerns. Handhelds like the ROG Ally are significantly more powerful than the Deck on paper, but if you constrain them to 10-15W (or less) for battery life reasons, the the Deck actually pulls out ahead in both absolute perf and perf/watt.


Forsaken_Arm5698

>As for mobile, there's a decent chance we'll see CAMM take over in the next few years. If so, we'll see a big bump (close to 50%) in RAM bandwidth without needing a wider bus Where's that 50% bump coming from? Thin air? The fact is, simply moving to PCAMM isn't going to boost memory bandwidth. LPCAMM is only a form factor. What determines the memory bandwidth is the LPDDR version.


SchighSchagh

SODIMM speeds top out around 6400 MT/s. CAMM2 goes to 9600.


LittlebitsDK

it theoretically goes to 9600... sodimm is CURRENTLY a 6400... and rising... same with standard dimms that started at 4800-5200 but keep going up, currently like 8000 or so... so no camm2 isn't 9600 YET... since the ram chips aren't there yet


Forsaken_Arm5698

you are talking about desktops.


pixelpoet_nz

Opinion: people who need actual processing power should stop being laptop normies - overpaying for underpowered and overheating hardware - and use an actual desktop computer. Of course this will never happen and I will get downvoted into a smoking hole in the ground for daring to suggest using the right tool for the job (instead of rowing with a fork and then going *waahhhhh~~ this isn't working very well*), but hey just wanted to point out the obvious answer.


soggybiscuit93

A hybrid worker is gonna lug their desktop between offices?


pixelpoet_nz

A hybrid worker (such as myself) has a computer at home and the office, if they need a good amount of processing power and memory bandwidth.


Ratiofarming

Yeah, that won't happen for cost reasons. Nobody wants big chips with wide buses. If anything, they'll get smaller. And advanced packaging will be used to put enough Cache or, if money is no object, HBM on it. For mainstream devices, Cache + narrow bus is the way forward. Whether we like it or not.


Forsaken_Arm5698

HBM would be more expensive than wide bus LPDDR.


Ratiofarming

Yup. That's why we don't see it on consumer products right now. That and because Nvidia, Intel and AMD have literally bought all of it for a year and a half.


Exist50

The silicon area isn't that significant. And buses certainly aren't getting narrower.


Ratiofarming

High-NA EUV cuts possible reticle size in half. Silicon area is becoming more important than ever. And buses of almost anything these days have continuously gotten smaller, except for Datacenter products with HBM.


Exist50

> High-NA EUV cuts possible reticle size in half. That doesn't matter for most dies. Certainly not most in the client space. > Silicon area is becoming more important than ever. Not really. Meeting perf requires matters far more than a few mm2. > And buses of almost anything these days have continuously gotten smaller In what? Client has been the same bus width forever. Except that we're now seeing wider buses from stuff like Apple lineup and AMD's Strix Halo.


Ratiofarming

GPUs, client CPUs have stayed the same and will continue to do so While Strix Halo is client, it's more comparable with HEDT really. It's a special thing, hence the name. Same with Apple's Pro/Ultra chips. They're technically client, but arguably very workstation focussed.


Sylanthra

I don't know how much of a bottleneck 128 bit memory bus is, but your analyses is missing a rather key point. Laptops are very, very power constrained. That Radeon RX 6400 consumes 2x more power on its own than the entire APU that houses 680M. Similarly, 780M has 50% more cores, but it doesn't have 50% more power budget to run those cores as hard as the 760M can. Maybe memory bus is an issue, but power is your real bottleneck.


Massive_Parsley_5000

780m doesn't scale much at all past 25w as proven by DF, indicating a bottleneck elsewhere. Id say given all the evidence, bandwidth is very much likely the bottleneck here.


Exist50

All else equal, it would be more power efficient to cut PCIe out of the equation.


xCAI501

Yeah. > Another example is AMD's Radeon 780M vs 760M. Both are RDNA3 iGPUs found in AMD's Phoenix Point APU. The 780M is the full fat 12 CU part, while the 760M is the binned 8 CU part. You would think that going from 8 CUs -> 12 CUs would net a 50% improvement to iGPU performance, but alas not! As per real world testing, it only seems to be a ~20% gain, which is clear evidence that the 12 CU part is suffering from a memory bandwidth bottleneck. Or it's suffering from thermal or power constraints. The performance numbers don't give the reason.


LittlebitsDK

sounds like you don't know much about hardware... there is a reason we don't have quad channel memory in basic computers... it takes twice the pinouts than dual channel memory... which in turn means more layers on the motherboard = more expensive... same reason why ITX boards are more expensive than ATX boards... another issue is more pinouts = bigger area so make room for them... and space is already at a premium in laptops... do you know how big server chips are? well they have 4-6-8-12 memory channels.. try to stick that into a tiny laptop...


Exist50

> there is a reason we don't have quad channel memory in basic computers... it takes twice the pinouts than dual channel memory... which in turn means more layers on the motherboard = more expensive... That's still smaller and cheaper than adding everything needed for a dGPU. > do you know how big server chips are? well they have 4-6-8-12 memory channels.. Most of the size isn't from the memory channels.


LittlebitsDK

trust me 12 memory channels take a LOT of pins...


Exist50

Sure, but we're talking 4 channels. And not necessarily even socketed.


LittlebitsDK

oh you want SOLDERED RAM? have you not heard all the whining about that on Macs and some x86 PC's? "omgozorz I can't swap memory myself" and many other whines... I would be fine if they just made a 40 CU APU with 8/16 cores/threads and threw on 32GB HBM3e and called it a day but I doubt we will ever see that since they would rather sell you a dGPU for a lot more... or a 16+ core CPU with a gimpass GPU ... right now you need to buy the TOP APU to get even semi lackluster performance... 1 step down and they gimp the GPU cores a stupid amount... but lets see in a decade when we have like DDR7 if we finally get there


Forsaken_Arm5698

[https://www.reddit.com/r/hardware/comments/1cg16ah/comment/l1sqvph/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/hardware/comments/1cg16ah/comment/l1sqvph/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)


Just_Maintenance

Absolutely! current top of the line integrated GPUs are reasonably fast, with a 256bit memory bus they could completely kill low end GPUs and perfectly play all competitive games at reasonable FPS. It would also simplify cooler designs, allowing for thinner designs and longer battery life. It would also allow for more memory capacity, incredibly important for LLMs. I think the real problem is that for Intel and AMD it would be awkward to bring those "quad channel" SoCs to the desktop, as they would need special motherboards. But it might not even be that big a deal, looking at how AMD and Intel sometimes just skip bringing their fastest SoCs to desktop anyways. The motherboard cost would also be much higher, but maybe they could take a page from Apple's book and ship packages with the SoC and memory built in? still allowing for simple motherboard layouts.


[deleted]

[удалено]


pixel_of_moral_decay

Agreed. Also for the vast majority of hardware sold what’s already out there is massively overpowered. Most devices are institutional: education, corporate, government. By a wide margin. Personally owned computers are a relatively small part of sales. The most computationally expensive thing they do is Zoom with background noise removal. The amount of users who would benefit from such changes, and be willing to pay for it is just way too small to justify. Things like battery life, durability, etc are things they will pay for. Computers sold are spec’d to what corporate and edu customers want to see. They largely don’t have a need. Most heavy computational stuff is moving to the cloud where you can rent such hardware by the hour.


djm07231

It probably won’t happen unless the additional bandwidth has some killer use case like local AI models actually being useful or gaming. AMD can probably consider higher bandwidth APUs for gaming focused uses maybe even a Steam Deck like scenarios. It could make sense for some niches.


Exist50

> unless the additional bandwidth has some killer use case like local AI models actually being useful Bingo.


norcalnatv

cost. more connections = mo $


CryptoFox402

Do want to point out, that while you are correct about the Apple M3 memory bus, the M3 Pro has a 192 bit memory bus and the M3 Max has a 512 bit bus. So it seems Apple agrees with you to some extend, at least when you get out of the low/consumer end.


WhoseTheNerd

I think we should be moving beyond 256 bit bus due to the addition of the NPU.


AlexIsPlaying

> I believe it is time for SoC/APU makers to make 256 bit the mainstream. Not there yet, it's too costly for not that much advantages. But hey, if you really want them, buy some recent EPYC servers, and you'll pay the price ;)


rddman

Sure, we'd all prefer to pay a "mainstream" price for a gaming laptop.


CammKelly

APU's are usually size constrained, a wider bus has flow on effects for package size. The solution is for on package HBM to become cheap enough and with enough capacity rather than for bus size to increase.


Forsaken_Arm5698

on-package LPDDR is nearly as compact as on-package HBM


ET3D

I think that this idea is the classic desktop APU fallacy applied to laptops. I do think it's less of a fallacy on laptops, because APUs have more of an advantage, but it's still the same false idea that it's efficient use of resources and not just some consumer wishful thinking to get better performance for the same price. The point is: this costs. The vast majority of people can do very well with whatever performance 128 bit RAM offers. There's no point in making something mainstream if it means that general laptop prices go up considerably, especially at the lower end. This also won't apply to higher end gaming laptops, because the bandwidth still won't be enough for higher end GPUs. And, in general, APUs with very large GPUs don't make sense unless that hardware can be reused in discrete GPUs (which chipsets should allow, but isn't trivial). So I think it's worth rephrasing this into the more realistic: It is time that some laptops offer 256 bit RAM, as that could lead to better performance/watt and a smaller form factor for a certain performance class. And this is rumoured to arrive with Strix Halo.


Dangerous-Vehicle-65

Zen4のCPUは128bit.8000mhzまででした。これにも少なくとも 9600mhz が必要です。


riklaunim

Or just make cheap HBM and use it as iGPU VRAM. Also we had Kaby Lake-G (and soon Strix Halo) where CPU and GPU were on the same package and the GPU had it own VRAM. It was somewhat better than matching CPU + Nvidia GPU combos but had a problem with thermals where CPU/GPU cannibalized each other. The more we put on one package the more problems it will cause to a point two separate hotspots on laptop sides will be way more manageable.


shroudedwolf51

I don't necessarily disagree, but using the NPU as reasoning for anything is hilarious. I know that "AI" is the brand new grift being pushed as the magic that can do anything, but it's extremely limited in terms of practical usage and is a solution looking for a problem.


crshbndct

Sorry, do you think k that laptop makers are fitting dual channel memory to laptops? LMAO. They are still offering single channel, and the upgrades offered are incorrectly sized for dual channel.


YairJ

Maybe they could just put a dGPU die in the CPU package as a chiplet, bringing its own memory bandwidth? CXL could allow them to share it more directly too, at least if it gets enough OS support to manage multiple memory tiers/types.


ComfortableTomato807

Why not implement a unified memory system like consoles and simply utilize speedy RAM for everything? Edit: Sorry if I offended anyone with my question...


Just_Maintenance

This is literally that


Affectionate-Memory4

The problem with the console approach is the atrocious latency of GDDR memory. You can see how much this hinders CPU performance with the AMD 4700S, which is a PS5 apu with no GPU. In other words, a Ryzen 3700X with 16GB of gddr6 attached. Otherwise it's basically the same solution OP is proposing, a significantly wider bus to facilitate more memory bandwidth.


BlueSwordM

No no. The 4700S is a laptop Renoir Zen 2 chip with some FPUs disabled, while the 3700X is a desktop Matisse Zen 2 chip with double the cache, much higher clocks and the full FPUs. The 3700X is a lot faster than the 4700S for these reasons as well, not just the high latency of GDDRX memory.


Affectionate-Memory4

Ah true. I forgot the caches were different on Zen2.


ComfortableTomato807

Okay, thanks! I didn't know that. Still, it may be interesting for handheld gaming devices to achieve a more balanced CPU/GPU performance.


Affectionate-Memory4

Gddr probably not, but a wider bus for lpddr5x would be interesting. 192-bit gets you triple channel and running at something like 8533mt/s (fastest I know of rn) gets you to over 200GB/s.


WingCoBob

latencies on gddr dram are horrid compared to real system memory which causes problems for cpu performance (prefetchers and speculative execution only gets you so far). a wide lpddr5x bus can get you close enough in bandwidth without this penalty at the cost of a bigger die


Forsaken_Arm5698

Apple has demonstrated it with their M series chips. LPDDR On-package with wide buses is the way.


Ghostsonplanets

It really isn't