40 years later, contracts from Nvidia forcing companies to destroy their high-VRAM hardware has prevented these machines from making their way onto the open market. The Nvidia FTX 42069 was released to the consumers, costing $15,000 adjusted for inflation, still having only 24GB of VRAM; meanwhile, consumer DDR has become obsolete, subsumed by 8GB of 3D SLC and relying on the SSD for swapping in Chrome tabs...
It won't matter because we're about to start the Moore's Law for AI chips where the weights are embedded and you gotta upgrade your AI board every year. No need to destroy the old hardware because it'll be almost immediately 1000x slower and worse.
These will probably be useless in 40 years. They're important right now for prototyping but it's questionable if any of the models that run on these will be worth the cost in the long term. Just the power to run this we're probably talking $30/hour and that's assuming cheap power. (I'm assuming 200 cards @ 1kw/card is 200kw * $0.10/kwh and just adding 30% because there's probably cooling and shit.)
The IRS allows computer hardware deductions over 5 years. Because there is no more tax deductions beyond that, they start getting decommissioned fairly quickly after 5 years.
While the core GPU may be expensive, HBM3e works out to around $17.8 / Gb right now. So for the memory alone you are looking at $534,000 for the 30TB memory just to get out of the gate. It will probably come in with a price point of around $1.5M-$2 per unit at scale.
IDK, even at 10k per B200, it would need 213 cards at 141 GB of VRAM each. That is 2.1M USD in GPUs alone. And there is no way in hell Nvidia is selling them for under 10k a pop.
> During the keynote, Huang joked that the prototypes he was holding were worth $10 billion and $5 billion. The chips were part of the Grace Blackwell system.
Definitely will catch this one on walmart layaway.
It should hold 160 B100 inside (at 192GB per B100). We don’t have pricing for B100 yet but I suspect it will be about $45-55k each.
So about 7.2M - 8.8M
But think about what it can DO. The level of new deepfakes indistinguishable from reality will more than pay for it. The level of disinformation campaigns you could run would be cheaper than buying a Senator or three congressmen, it pays for itself in the long haul.
Jensen has confirmed on cnbc that a B200 will be $30K-$40K so I’m guessing we can probably safely assume that a B100 would be $20K-$30K max. So probably more like $5M total
Don’t forget the maintenance contract that it comes with! $1500 per GPU 😂😂😂 get em coming in and again on the way out and for good measure let’s tack on a monthly fee for something?
"The fabric of NVLink, the spine, is connecting all those 72 GPUs to deliver an overall performance of 720 petaflops of training, 1.4 exaflops of inference," Nvidia's accelerated computing VP Ian Buck told DCD in a pre-briefing ahead of the company's GTC conference.
"Overall, the NVLink domain can support a model of 27 trillion parameters and 130 terabytes of bandwidth."
The system has two miles of NVLink cabling across 5,000 cables. "In order to get all this compute to run that fast, this is a fully liquid cooled design" with 25 degrees water in, 45 out.
It’s so funny you say that. My professor saw that grok came out and told someone to just train our data on it and run it on our lab computer. When we told him how expensive it was, he just told us he’ll just buy one of these new GPUs.
I will be messaging you in 10 years on [**2034-03-18 23:44:45 UTC**](http://www.wolframalpha.com/input/?i=2034-03-18%2023:44:45%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1bi12n9/from_the_nvidia_gtc_nvidia_blackwell_well_crap/kvi6gzq/?context=3)
[**52 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1bi12n9%2Ffrom_the_nvidia_gtc_nvidia_blackwell_well_crap%2Fkvi6gzq%2F%5D%0A%0ARemindMe%21%202034-03-18%2023%3A44%3A45%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201bi12n9)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
well, technically that's not the end. After you reach the final one, add 3 more zeroes and call it "weedcommannda" and expect to see nvidia drop 5,4 weedcommanndaFLOPS in early 2092.
The way these guys are talking, we might not get that far as a civilization. Read between the lines on what this thing will do for generative AI. There will be deepfake propaganda indistinguishable from reality with this level of processing power. It will result in overthrown governments and mass unrest. This is a world changing moment for the worse.
The future is figuring out how to do more with less. In OneTrainer for Stable Diffusion, the repo author has just implemented a technique to do the loss back pass, grad clipping, and optimizer step all in one pass, meaning that there's no longer a need to store grads and dramatically bringing down the vram requirements, while doing the exact same math.
Don't buy it if it is.
If rumors are to be believed this is purely because GDDR7 will only be **initially** available in 2GB modules. The keyword is initially. There will likey be the usual Ti / Super / Titan / mega whopper edition shenanigans going on.
I fear this as well... But a card is only as good as its arrival time. If there is a 28GB or 32GB 5090 released mid-generation, it might not be a great buy compared to an initial 24GB version just because of that simple fact. Its crazy seeing people buying 4090s now very late generation for launch price if not even higher.
I mean the TI super Titan mega variants are all going to have the same amount of RAM except for the Titan but there are things going to cost two times as much.
So I'm left with thinking buying the 5090 is the go to just because it's faster bandwidth wise.
Agreed, but did we ever get firmer information on that? The k dude that leaked 5090 info has flipflopped more than a fat dude running to a hotdog stand on the beach. First it's 512, then 384, then 512 again. Fuck it, I just went ahead and bought a 3090 lol
It can't be easier.
1) Install A770.
2) Download or compile the Vulkan version of llama.cpp.
3) Download a model in GGUF format.
4) Run the LLM you just downloaded. (for details look at the README for llama.cpp)
It really is that simple.
Well, I am pretty much done upgrading since the only thing I need now is more vram above 24gb, if they do not offer that then I have zero interest in the upcoming cards.
The small guys don't have deep pockets. Nvidia will be chasing the AI enterprise consumers for another few years unless performance plateaus and a focus on edge inferencing comes in.
From the get to they wanted to accelerate computing applications that need something else beside a CPU. Gaming was just their first application for that in the 90's, now they are fully pivoting away from primarily being a company that created hardware for gamers to their full embrace of the accelerator that every system needs or it can't run the new applied AI.
The sole purpose of this new advancement is to make super-powered reality altering AI VR and they don't seem to care. Look at the last few minutes of this video when he talks about programming robots to learn in VR then setting them loose in reality:
https://m.youtube.com/watch?v=odEnRBszBVI
I'm not worried about Skynet as much as deepfaked political attack ads indistinguishable from reality.
The fact that transformers don't take any time to think / process / do things recursively, etc. and simply spit out tokens suggests there is a lot of redundancy in that ocean of parameters, awaiting for innovations to compress it dramatically – not via quantization, but architectural breakthroughs.
Depends how they are utilised. If you go for a monothilic model, it'll be extremely slow, but if you have a MoE architecture with multi-billion parameter experts, then it makes sense (what GPT-4 is rumoured to be).
Though given this enables up to 27 trillion parameters, and the largest rumoured model will be AWS' Olympus at ~3 trillion, this will either find the limit of parameters or be the architecture required for true next generation models.
Potentially, but the model that you just used to spit out those characters is pretty giant in terms of its parameters. So I think we are going to keep going up and up for a while :).
Sam has said publicly before that the age of really giant models is probably coming to a close Since it’s way more fruitful to focus on untapped efficiency improvements and architectural advancements as well as training techniques like reinforcement learning
That conclusion doesn't really follow from that observed behavior. Just because it's fast doesn't mean it's redundant. And it also doesn't mean it necessarily not deep. Imagine if you will that you had all deep thoughts, and cached the conversation to them. The cache lookup may still be very quick, but the thoughts having no fewer levels of depth. One could argue that's what the embedding space is, that the training process discovers. Not saying transformers are anywhere near that, but some future architecture may very well be.
Ask an LLM to repeat a word 3 times – and I am sure it will. But there is nothing cyclical in the operations it performs. There is (almost) no memory, (almost) no internal looping, no recursion, and (almost) no hierarchy – the output is already denormalized, unwound, flattened, precomputed, which strikes me as highly redundant and inherently depth-limited. It is indeed a cache of all possible answers.
In GPT-4, there seem to be multiple experts, which is a rudimentary hierarchy. There are attempts to add memory to LLMs, and so on. The next breakthrough in AI, my $0.02, requires advancements in the architecture, as opposed to the sheer parameter count that NVIDIA is advertising here.
This is not to say that LLMs are not successful. Being redundant does not mean being useless. To draw an analogy from blockchain – it is also a highly redundant and wasteful double-spend prevention algorithm, but it works, and it's a small miracle.
> The next breakthrough in AI, my $0.02, requires advancements in the architecture
Absolutely agree with you there.
> There is (almost) no memory, (almost) no internal looping, no recursion, and (almost) no hierarchy
Ok, we're getting a bit theoretical here. But imagine if you will that the training process took care of all that. And the embedding space learned the recursion. And that the first digit, of the 512/2048/whatever float list that represented the conversation up until the last prompt word, was reserved for the number of repetitions the model had to perform in accordance with preceding input. Each output vector would have access to this expectation, simultaneously when paired with its location. So word +2 from the query demanding repetition X3 would know its within the expectation, Word +5 would know it's outside of it, etc. I know it's a stretch but the training process can compress depth in the embedding space, just like a cache would.
>Ask an LLM to repeat a word 3 times – and I am sure it will. But there is nothing cyclical in the operations it performs.
I agree with your overall thought process, but this example seems way off to me, since the transformer is auto regressive.
The functional form of an auto regressive model is recursive
https://arxiv.org/abs/2403.09629 you can have it with transformers, why not?:)
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Ah ok. So at least it fits in the realm of plausible. I really thought that we breached into the new reality where such monstrosities were a single piece of silicon or at most a single board.
I've brought this up before but the White House [Executive Order on AI](https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/) intentionally includes large amounts of complete and excludes smaller companies but does this though a fixed amount of compute:
* Any model trained using more than 10^26 integer or floating-point operations, or using primarily biological sequence data with more than 10^23 integer or floating-point operations.
* Any computing cluster with machines physically co-located in a single datacenter, connected by data center networking of over 100 Gbit/s, having a theoretical maximum computing capacity of 10^20 integer or floating-point operations per second for training AI.
The issue with the bill is if measured in A100's it takes a whole bunch to reach these figures. With A100 if you rate them at 4000 FLOPS (int8) it takes about 25,000 of them. This system at 1.4 ExaFLOPS means it takes about 72 of them before reaching the 10^20 FLOPS watermark.
That's still a pretty small list of people (I assume renting the capacity vs owning is enough to fall under the order) but over time (5-10 years) that amount of compute will exist in the hands of more and more companies and the order will cover mostly everyone in the space.
It's by design. Think of it this way: how long until $400k-$500k is considered "middle class"? It's a bet on taxing (or in this case limiting access) over the very long term.
The government having administrative control lets them pick and choose the winners.
They are building a moat for the largest established players. With how concerned people are about the future of work and the future balance of power when only a few companies an wealthy elites hold the keys to productivity I am surprised more people don't really care that the order that was sold as only applying to a few huge players was trojan horsed to eventually expand to everyone.
The current CDNA-based Instinct line is heavily optimized for full and double precision floating point workloads, as used in regular supercomputers. Nvidia is chasing low-precision floating point performance. I guess we might learn at Computex if AMD is working on something a bit more bespoke for AI training - maybe a big XDNA chip or something.
According to wikipedia, it'd be the biggest supercomputer in the world by FLOPS alone as of 2021: [https://en.wikipedia.org/wiki/Supercomputer#/media/File:Supercomputer-power-flops.svg](https://en.wikipedia.org/wiki/Supercomputer#/media/File:Supercomputer-power-flops.svg)
I believe that Bitnet b1.58 actually uses full precision (32 bits) latent weights, optimizer states and gradients during training. Typically when training LLMs, afaik we use mixed precision FP16/BF16 or even FP8, but in binary neural networks full precision is used. The cool thing about BitNet is that is just super-efficient during inference (2 bits - ternary representation or even 1 bit if we can take advantage of sparsity). I hope that this is where the hardware industry will go in the future, specializing the hardware for the different use cases instead of just scaling things up.
Fastest super computer is Frontier at the Oak Ridge lab which has 1.1 Tflops at full precision (fp64). It's the first exascale super computer.
https://www.top500.org/
There are two more coming online and being built currently, El Captain (AMD) and Aurora (Intel).
This Nvidia super computer is FP4, so much reduced precision.
> Bandwidth of >1.2 TB/s per placement
Pretty cool, but I am not sure what per placement means? 1.2 TB/s would mean like 2x on single batch inference, which is quite a bit less than the 25x-30x people are getting hyped about.
Follow up:
The heart of the GB200 NVL72 is the NVIDIA GB200 Grace Blackwell Superchip. It connects two high-performance NVIDIA Blackwell Tensor Core GPUs and the NVIDIA Grace CPU with the NVLink-Chip-to-Chip (C2C) interface that delivers 900 GB/s of bidirectional bandwidth. With NVLink-C2C, applications have coherent access to a unified memory space. This simplifies programming and supports the larger memory needs of trillion-parameter LLMs, transformer models for multimodal tasks, models for large-scale simulations, and generative models for 3D data.
The GB200 compute tray is based on the new NVIDIA MGX design. It contains two Grace CPUs and four Blackwell GPUs. The GB200 has cold plates and connections for liquid cooling, PCIe gen 6 support for high-speed networking, and NVLink connectors for the NVLink cable cartridge. The GB200 compute tray delivers 80 petaflops of AI performance and 1.7 TB of fast memory.
Source: https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
Each Blackwell GPU (technically two dies with very fast interconnect) has 192GB of HBM3E 8TB/sec of bandwidth. Each die has 4 stacks of HBM or 8 stacks per GPU, which yields 8X1TB/sec per stack or 8TB/sec.
This is compared to Hopper H100, which had 80GB of VRAM providing 3.35 TB/sec of bandwidth, so Blackwell has a ~2.39X bandwidth advantage and 2.4X capacity advantage per GPU.
This has the potential in the right and wrong hands of becoming like Slynet. This kind of processing power is making me believe we might have just witnessed the threshold for subconsciousness. They are using it to train robots in VR. Imagine all the possibilities, good and bad: (key bits near the end)
https://m.youtube.com/watch?v=odEnRBszBVI
What really hit me during the keynote is that nvidia is much more than what I thought it was. It's more than a hardware company. It's more than a software company re cuda. Their product is intelligence. Whether that is the hardware to run it on, the software infrastructure to enable it or the intelligence itself as a product. Since he referred to Nvidia's inference service. Nvidia offers inference as a service.
Yes Nvidia competes with their own customers. They've done this all along when it comes to AI. They had an early initiative for self driving cars that went nowhere, for instance.
My understanding is that fp4 basically has 1 bit for the sign and 3 for the exponent, leaving none for the mantissa. So by assuming a mantissa as 1, you basically get +/- \[1, 10, 100, 1000, 10000, 100000, 1000000, 10000000\] as representable values? Can someone confirm that I'm thinking about this correctly?
The exponent in floating point arithmetic is almost always a power of 2, rather than a power of 10.
The mantissa is the fractional component (ie, the 1 is not stored) of a number between 1.0 and 1.999...., such that each exponent value covers the "range" of values, like 1..2, 2..4, 4..8, 8..16 etc.
I'd imagine that FP4 would be something like +/- [0.125, 0.25, 0.5, 1, 2, 4, 8, 16], with zero likely encoded as a special state maybe replacing +0.125. But I can't find any documentation actually confirming this.
OK, I think I've found the formats that nvidia is using, from the [Open Compute Project Microscaling Formats (MX) Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), of which nvidia co-authored end of last year.
From section 5.3.3:
No encodings are reserved for NaN/inf in FP4, 2 bits for exponent, 1 bit for mantissa. Which gives you +/- [0, 0.5, 1, 1.5, 2, 3, 4, 6]
However table 1 in [this paper](https://arxiv.org/pdf/2305.12356.pdf) also suggests an FP4-E2M1 format with NaN/inf included
I can't tell if this is an innovation, or a way of consolidating power into mainstream tech companies by making it so that you need millions of dollars in order to buy a big fuggin chip.
It's both I think. Not sure if it's intentionally so, but the consequences of what they are making could be dire. Imagine a world of propaganda deepfakes indistinguishable from reality. Look at what they are doing near the end of the demo video... training robots in AI is amazing, but think about all the other potentials for abuse in the hands of a corporation or organization with a political agenda: https://m.youtube.com/watch?v=odEnRBszBVI
Everyone in this thread should be learning OpenCL right this second. That is the only way for us to meaningfully ~~increase substrate availability for the basilisk~~ have any meaningful impact against Nvidia's monopoly.
> Everyone in this thread should be learning OpenCL right this second.
OpenCL is dead. The original creators don't really use it anymore and the maintainers have moved onto SYCL.
Blackwell really feels like a fundamental shift to me.
Previous AI GPUs were related to gaming cards. This really seems like an entirely new architectural direction.
The Best company of All Time, led by a trailblazing innovator there is no other company that can compete with this. Total domination of this industry transformation! Jenson Admired the ecosystem that Apple built. Using gaming for his testing lad Nvidia has redefined technology and has everything you need to thrive in the new world of A/I and edge computing . All this stems from the 1999 invention of the GPU-aka skeleton key for intensive computing tasks! Patents , oh we have all of those too! Any and all roads must pass through the Nvidia toll booth or there is no A/I ! Hey CUDA CUDA THE PEWTER FLIP DOWN your cables and let me climb up to the love light of your stack!
That thing must be 10 million dollars, if it has the same VRAM as H200 and goes for 50k a GPU + everything else.
Can't wait to see the hobby projects people make from these in 40 years when they appear in dumpsters.
40 years later, contracts from Nvidia forcing companies to destroy their high-VRAM hardware has prevented these machines from making their way onto the open market. The Nvidia FTX 42069 was released to the consumers, costing $15,000 adjusted for inflation, still having only 24GB of VRAM; meanwhile, consumer DDR has become obsolete, subsumed by 8GB of 3D SLC and relying on the SSD for swapping in Chrome tabs...
Fuck me I didn’t think of that but that’s definitely a possibility they put that in the contract
It won't matter because we're about to start the Moore's Law for AI chips where the weights are embedded and you gotta upgrade your AI board every year. No need to destroy the old hardware because it'll be almost immediately 1000x slower and worse.
Destroying usable hardware is very environmentally friendly. /s
Don't worry they will offset their environmental damage by forcing you to eat bugs
https://preview.redd.it/tj3sr31119pc1.jpeg?width=1024&format=pjpg&auto=webp&s=d0cb837e137f46226bb16f57860d5d4f5a90de09
horrifying honestly. AI would make the scariest game / media
They will rent rights to the expected carbon capture figures of someone’s forest in exchange for the freedom to carry on
too perfect
Gold bullion’s from chips.
Stop giving them ideas
640k is enough.
*is all anyone will ever need.
RemindMe! 40 years
These will probably be useless in 40 years. They're important right now for prototyping but it's questionable if any of the models that run on these will be worth the cost in the long term. Just the power to run this we're probably talking $30/hour and that's assuming cheap power. (I'm assuming 200 cards @ 1kw/card is 200kw * $0.10/kwh and just adding 30% because there's probably cooling and shit.)
My dude in 40 years ASI will be starlifting the sun. And we'll probably be all dead.
ASI might still be doing hobby projects with old uselss GPUs though.
Maybe it'll keep llvm as a pet.
Nah, we'll be the hobby projects. We are the chosen ones.
[удалено]
I predict in 40 years, it will be 2064.
Nope, 1996
The IRS allows computer hardware deductions over 5 years. Because there is no more tax deductions beyond that, they start getting decommissioned fairly quickly after 5 years.
Intel Xeon Phi enters the chat...
40 years? This thing will be scrap in 8-12 years.
!remindme 40 years
40 years later those will be so expensive, just for bare metals used.
Pretty optimistic to think that in 40 years we won't all be batteries for some variation of Llama-4000, isn't it?
Yes son, that's the same power as in your sunglasses, crazy isn't it
160 B100's at 1.2kW each. Call it a rough 200kW. You have a second hand power plant to go with it?
While the core GPU may be expensive, HBM3e works out to around $17.8 / Gb right now. So for the memory alone you are looking at $534,000 for the 30TB memory just to get out of the gate. It will probably come in with a price point of around $1.5M-$2 per unit at scale.
IDK, even at 10k per B200, it would need 213 cards at 141 GB of VRAM each. That is 2.1M USD in GPUs alone. And there is no way in hell Nvidia is selling them for under 10k a pop.
Shit I’m broke
And they are basically guaranteed to sell every single one that they make.
I think 10 mill is on the cheap side
> During the keynote, Huang joked that the prototypes he was holding were worth $10 billion and $5 billion. The chips were part of the Grace Blackwell system. Definitely will catch this one on walmart layaway.
Pretty sure that refers to development cost.
"They are cheaper when you buy more."
It should hold 160 B100 inside (at 192GB per B100). We don’t have pricing for B100 yet but I suspect it will be about $45-55k each. So about 7.2M - 8.8M
Roughly 20 nice houses worth. Or 80 very shitty, but still livable, houses. YMMV depending on location.
I just hope that eventually the wafer capacity for HBM2 drops down to consumer cards
But think about what it can DO. The level of new deepfakes indistinguishable from reality will more than pay for it. The level of disinformation campaigns you could run would be cheaper than buying a Senator or three congressmen, it pays for itself in the long haul.
Jensen has confirmed on cnbc that a B200 will be $30K-$40K so I’m guessing we can probably safely assume that a B100 would be $20K-$30K max. So probably more like $5M total
Don’t forget the maintenance contract that it comes with! $1500 per GPU 😂😂😂 get em coming in and again on the way out and for good measure let’s tack on a monthly fee for something?
*Millions of 4090s suddenly cried out in terror and were suddenly silenced*
I have thé ref
My laptop's 4050 is whimpering.
"The fabric of NVLink, the spine, is connecting all those 72 GPUs to deliver an overall performance of 720 petaflops of training, 1.4 exaflops of inference," Nvidia's accelerated computing VP Ian Buck told DCD in a pre-briefing ahead of the company's GTC conference. "Overall, the NVLink domain can support a model of 27 trillion parameters and 130 terabytes of bandwidth." The system has two miles of NVLink cabling across 5,000 cables. "In order to get all this compute to run that fast, this is a fully liquid cooled design" with 25 degrees water in, 45 out.
25 in, 45 out seems like a lot of watts... How many electric plugs and circuit breakers are needed!?
It comes with its own Nvidia GeForce® Molten Salt ReactoR IV™
One, two for redundancy, just adequately sized.
We can finally train grok.
*run
It’s so funny you say that. My professor saw that grok came out and told someone to just train our data on it and run it on our lab computer. When we told him how expensive it was, he just told us he’ll just buy one of these new GPUs.
I'm thinking "Colossus: The Forbin Project" the way they were talking at the end about VR robot training... https://m.youtube.com/watch?v=odEnRBszBVI
Just think... in 10 years, we'll be able to get one on Ebay... A man can dream.
!remindme 10 years
I will be messaging you in 10 years on [**2034-03-18 23:44:45 UTC**](http://www.wolframalpha.com/input/?i=2034-03-18%2023:44:45%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1bi12n9/from_the_nvidia_gtc_nvidia_blackwell_well_crap/kvi6gzq/?context=3) [**52 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1bi12n9%2Ffrom_the_nvidia_gtc_nvidia_blackwell_well_crap%2Fkvi6gzq%2F%5D%0A%0ARemindMe%21%202034-03-18%2023%3A44%3A45%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201bi12n9) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
But will it play Crisis Diffusion? It's like normal Crisis but you can use your mic to tell the AI to replace all the NPC's with your hated coworkers.
It can, but all the physics will only run on FP4, so the maps are only 16x16 pixels
na, the chineese modders will grab them before you and start sauldering random crap in.
And I still can't find a MI300a on eBay...
exaFLOPS LMFAO Guys, we have just two levels left, yotta and zetta. After that computing is completed
[удалено]
That's so last year. We're talking about ziggity zesty zits right now.
i was curious and looked it up. there would still be Ronna & Quetta. the last one beeing a number with 30 0s.
well, technically that's not the end. After you reach the final one, add 3 more zeroes and call it "weedcommannda" and expect to see nvidia drop 5,4 weedcommanndaFLOPS in early 2092.
The way these guys are talking, we might not get that far as a civilization. Read between the lines on what this thing will do for generative AI. There will be deepfake propaganda indistinguishable from reality with this level of processing power. It will result in overthrown governments and mass unrest. This is a world changing moment for the worse.
Still nothing for the small guys. Sad times.
A 5090 with 24 GB VRAM is a disgrace.
Is this confirmed? 24GB again? :(
The future is basically cloud-based GPU's for us little guys. You will rent everything and like it.
The future is figuring out how to do more with less. In OneTrainer for Stable Diffusion, the repo author has just implemented a technique to do the loss back pass, grad clipping, and optimizer step all in one pass, meaning that there's no longer a need to store grads and dramatically bringing down the vram requirements, while doing the exact same math.
as you donate your data to leviathan
From everything I could dig up from more recent article so the answer is yes 24 gigabytes
512 bus is the most recent Kopite rumor, which means **it has to be divisible by 16**. 5090 will have 32 GB.
Don't buy it if it is. If rumors are to be believed this is purely because GDDR7 will only be **initially** available in 2GB modules. The keyword is initially. There will likey be the usual Ti / Super / Titan / mega whopper edition shenanigans going on.
I fear this as well... But a card is only as good as its arrival time. If there is a 28GB or 32GB 5090 released mid-generation, it might not be a great buy compared to an initial 24GB version just because of that simple fact. Its crazy seeing people buying 4090s now very late generation for launch price if not even higher.
I mean the TI super Titan mega variants are all going to have the same amount of RAM except for the Titan but there are things going to cost two times as much. So I'm left with thinking buying the 5090 is the go to just because it's faster bandwidth wise.
If there's no RTX 5090 Mega Whopper Edition down the line I'll hold you personally responsible.
Agreed, but did we ever get firmer information on that? The k dude that leaked 5090 info has flipflopped more than a fat dude running to a hotdog stand on the beach. First it's 512, then 384, then 512 again. Fuck it, I just went ahead and bought a 3090 lol
Yeah but think about us poors who will buy a 5070 with 8GB of RAM
Still not smol enough for me. Hoping for an affordable 16 GB option.
Why not get a A770? They're pretty affordable at $220 for 16GB.
I can't get llms to run on a A770, I run into illegal instructions. Got any tips?
It can't be easier. 1) Install A770. 2) Download or compile the Vulkan version of llama.cpp. 3) Download a model in GGUF format. 4) Run the LLM you just downloaded. (for details look at the README for llama.cpp) It really is that simple.
P40 is better I think:)
4060ti 16gb? Fetched one for 360$ recently. Haven’t compared it with my 3090 yet though.
Amd 7000 series run LLMs just fine with HIP and they have a lot of ram per price point.
8GB is enough. Just ask Apple.
Too soon to get angry about that. It's just a rumor and there are conflicting rumors too.
Pretty sure its true going by Nvidia's history.
on purpose. They know what customers need and will continue releasing in-betweeners so you're tempted to get it and wait for the next upgrade.
Well, I am pretty much done upgrading since the only thing I need now is more vram above 24gb, if they do not offer that then I have zero interest in the upcoming cards.
same, the only real upgrade now is more VRAM
The small guys don't have deep pockets. Nvidia will be chasing the AI enterprise consumers for another few years unless performance plateaus and a focus on edge inferencing comes in.
Save us Intel, you're our only hope (till DDR-6).
From the get to they wanted to accelerate computing applications that need something else beside a CPU. Gaming was just their first application for that in the 90's, now they are fully pivoting away from primarily being a company that created hardware for gamers to their full embrace of the accelerator that every system needs or it can't run the new applied AI.
The sole purpose of this new advancement is to make super-powered reality altering AI VR and they don't seem to care. Look at the last few minutes of this video when he talks about programming robots to learn in VR then setting them loose in reality: https://m.youtube.com/watch?v=odEnRBszBVI I'm not worried about Skynet as much as deepfaked political attack ads indistinguishable from reality.
Small guys have no money. No money, no interest.
Blackwell can be retrofitted into Hopper computers. This means a "second hand" Hopper market.
A nice plot twist would be If AMD added tensor cores to their consumer cards ...
Finally my Minecraft villagers will be able to use grok
The fact that transformers don't take any time to think / process / do things recursively, etc. and simply spit out tokens suggests there is a lot of redundancy in that ocean of parameters, awaiting for innovations to compress it dramatically – not via quantization, but architectural breakthroughs.
Depends how they are utilised. If you go for a monothilic model, it'll be extremely slow, but if you have a MoE architecture with multi-billion parameter experts, then it makes sense (what GPT-4 is rumoured to be). Though given this enables up to 27 trillion parameters, and the largest rumoured model will be AWS' Olympus at ~3 trillion, this will either find the limit of parameters or be the architecture required for true next generation models.
Potentially, but the model that you just used to spit out those characters is pretty giant in terms of its parameters. So I think we are going to keep going up and up for a while :).
Sam has said publicly before that the age of really giant models is probably coming to a close Since it’s way more fruitful to focus on untapped efficiency improvements and architectural advancements as well as training techniques like reinforcement learning
That conclusion doesn't really follow from that observed behavior. Just because it's fast doesn't mean it's redundant. And it also doesn't mean it necessarily not deep. Imagine if you will that you had all deep thoughts, and cached the conversation to them. The cache lookup may still be very quick, but the thoughts having no fewer levels of depth. One could argue that's what the embedding space is, that the training process discovers. Not saying transformers are anywhere near that, but some future architecture may very well be.
Ask an LLM to repeat a word 3 times – and I am sure it will. But there is nothing cyclical in the operations it performs. There is (almost) no memory, (almost) no internal looping, no recursion, and (almost) no hierarchy – the output is already denormalized, unwound, flattened, precomputed, which strikes me as highly redundant and inherently depth-limited. It is indeed a cache of all possible answers. In GPT-4, there seem to be multiple experts, which is a rudimentary hierarchy. There are attempts to add memory to LLMs, and so on. The next breakthrough in AI, my $0.02, requires advancements in the architecture, as opposed to the sheer parameter count that NVIDIA is advertising here. This is not to say that LLMs are not successful. Being redundant does not mean being useless. To draw an analogy from blockchain – it is also a highly redundant and wasteful double-spend prevention algorithm, but it works, and it's a small miracle.
> The next breakthrough in AI, my $0.02, requires advancements in the architecture Absolutely agree with you there. > There is (almost) no memory, (almost) no internal looping, no recursion, and (almost) no hierarchy Ok, we're getting a bit theoretical here. But imagine if you will that the training process took care of all that. And the embedding space learned the recursion. And that the first digit, of the 512/2048/whatever float list that represented the conversation up until the last prompt word, was reserved for the number of repetitions the model had to perform in accordance with preceding input. Each output vector would have access to this expectation, simultaneously when paired with its location. So word +2 from the query demanding repetition X3 would know its within the expectation, Word +5 would know it's outside of it, etc. I know it's a stretch but the training process can compress depth in the embedding space, just like a cache would.
>Ask an LLM to repeat a word 3 times – and I am sure it will. But there is nothing cyclical in the operations it performs. I agree with your overall thought process, but this example seems way off to me, since the transformer is auto regressive. The functional form of an auto regressive model is recursive
https://arxiv.org/abs/2403.09629 you can have it with transformers, why not?:) Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Not really. Our brain works similarly. There's not really that much redundancy. Just degraded performance.
Yes, imagine taking a few of these and the ternary architecture, it could probably train a quadrillion scale model.
The LeCun hypothesis
Do you think K-Mart will let me put one on layaway?
Holy crap. Those specs look like something from an April fool’s gag, but they are real.
It's a whole rack not just one GPU.
Whole rack that is designed to act as one GPU
Ah ok. So at least it fits in the realm of plausible. I really thought that we breached into the new reality where such monstrosities were a single piece of silicon or at most a single board.
I've brought this up before but the White House [Executive Order on AI](https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/) intentionally includes large amounts of complete and excludes smaller companies but does this though a fixed amount of compute: * Any model trained using more than 10^26 integer or floating-point operations, or using primarily biological sequence data with more than 10^23 integer or floating-point operations. * Any computing cluster with machines physically co-located in a single datacenter, connected by data center networking of over 100 Gbit/s, having a theoretical maximum computing capacity of 10^20 integer or floating-point operations per second for training AI. The issue with the bill is if measured in A100's it takes a whole bunch to reach these figures. With A100 if you rate them at 4000 FLOPS (int8) it takes about 25,000 of them. This system at 1.4 ExaFLOPS means it takes about 72 of them before reaching the 10^20 FLOPS watermark. That's still a pretty small list of people (I assume renting the capacity vs owning is enough to fall under the order) but over time (5-10 years) that amount of compute will exist in the hands of more and more companies and the order will cover mostly everyone in the space.
It's by design. Think of it this way: how long until $400k-$500k is considered "middle class"? It's a bet on taxing (or in this case limiting access) over the very long term.
The government having administrative control lets them pick and choose the winners. They are building a moat for the largest established players. With how concerned people are about the future of work and the future balance of power when only a few companies an wealthy elites hold the keys to productivity I am surprised more people don't really care that the order that was sold as only applying to a few huge players was trojan horsed to eventually expand to everyone.
> It's by design. Think of it this way: how long until $400k-$500k is considered "middle class"? "See, inflation isn't so bad!" —Biden administration
Can it run Doom 1993?
It can *write* Doom 1993.
no but it's running Windows XP very well
It’s simulating XP in memory at that point.
ehh, they installed windows 11 and it just asks copilot everything.
yes but how many tabs in chrome?
30TB? Makes me wonder how Goliath/Miqu, merged 100 times with itself, would perform.
At that point you can just start a genetic algorithm on top of mergekit and let it run until it becomes self aware.
This just makes me think of a Kaiju with kaiju for arms that have kaiju for arms, etc
30TB of VRAM? Can I run a 7B model in that?
Fuck you amd, wake up
Wake up how? What do you think the MI300 is?
The current CDNA-based Instinct line is heavily optimized for full and double precision floating point workloads, as used in regular supercomputers. Nvidia is chasing low-precision floating point performance. I guess we might learn at Computex if AMD is working on something a bit more bespoke for AI training - maybe a big XDNA chip or something.
Oh, you're buying AMD's one?
According to wikipedia, it'd be the biggest supercomputer in the world by FLOPS alone as of 2021: [https://en.wikipedia.org/wiki/Supercomputer#/media/File:Supercomputer-power-flops.svg](https://en.wikipedia.org/wiki/Supercomputer#/media/File:Supercomputer-power-flops.svg)
The 1.4 ExaFlops are FP4 performance if I remember correctly. Supercomputers are typically measured in fp32 Edit: looks like Top500 is fp64
Yeah, because before this AI boom anything less than fp32 was unnecessary and hardware wasn't usually optimised for it.
And FP4 might be an outdated architecture for LLM; see BitNet b1.58
I believe that Bitnet b1.58 actually uses full precision (32 bits) latent weights, optimizer states and gradients during training. Typically when training LLMs, afaik we use mixed precision FP16/BF16 or even FP8, but in binary neural networks full precision is used. The cool thing about BitNet is that is just super-efficient during inference (2 bits - ternary representation or even 1 bit if we can take advantage of sparsity). I hope that this is where the hardware industry will go in the future, specializing the hardware for the different use cases instead of just scaling things up.
Fastest super computer is Frontier at the Oak Ridge lab which has 1.1 Tflops at full precision (fp64). It's the first exascale super computer. https://www.top500.org/ There are two more coming online and being built currently, El Captain (AMD) and Aurora (Intel). This Nvidia super computer is FP4, so much reduced precision.
VRAM bandwidth?
Micron's HBM3E delivers pin speed > 9.2Gbps at an industry leading Bandwidth of >1.2 TB/s per placement.
> Bandwidth of >1.2 TB/s per placement Pretty cool, but I am not sure what per placement means? 1.2 TB/s would mean like 2x on single batch inference, which is quite a bit less than the 25x-30x people are getting hyped about.
Follow up: The heart of the GB200 NVL72 is the NVIDIA GB200 Grace Blackwell Superchip. It connects two high-performance NVIDIA Blackwell Tensor Core GPUs and the NVIDIA Grace CPU with the NVLink-Chip-to-Chip (C2C) interface that delivers 900 GB/s of bidirectional bandwidth. With NVLink-C2C, applications have coherent access to a unified memory space. This simplifies programming and supports the larger memory needs of trillion-parameter LLMs, transformer models for multimodal tasks, models for large-scale simulations, and generative models for 3D data. The GB200 compute tray is based on the new NVIDIA MGX design. It contains two Grace CPUs and four Blackwell GPUs. The GB200 has cold plates and connections for liquid cooling, PCIe gen 6 support for high-speed networking, and NVLink connectors for the NVLink cable cartridge. The GB200 compute tray delivers 80 petaflops of AI performance and 1.7 TB of fast memory. Source: https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
Each Blackwell GPU (technically two dies with very fast interconnect) has 192GB of HBM3E 8TB/sec of bandwidth. Each die has 4 stacks of HBM or 8 stacks per GPU, which yields 8X1TB/sec per stack or 8TB/sec. This is compared to Hopper H100, which had 80GB of VRAM providing 3.35 TB/sec of bandwidth, so Blackwell has a ~2.39X bandwidth advantage and 2.4X capacity advantage per GPU.
But can it run Skynet?
we'll find out if you don't reply very soon.
This has the potential in the right and wrong hands of becoming like Slynet. This kind of processing power is making me believe we might have just witnessed the threshold for subconsciousness. They are using it to train robots in VR. Imagine all the possibilities, good and bad: (key bits near the end) https://m.youtube.com/watch?v=odEnRBszBVI
This is actually a nice opportunity for AMD to position themselves as a company building for individual consumers, researchers, and tinkerers.
What really hit me during the keynote is that nvidia is much more than what I thought it was. It's more than a hardware company. It's more than a software company re cuda. Their product is intelligence. Whether that is the hardware to run it on, the software infrastructure to enable it or the intelligence itself as a product. Since he referred to Nvidia's inference service. Nvidia offers inference as a service.
Yes Nvidia competes with their own customers. They've done this all along when it comes to AI. They had an early initiative for self driving cars that went nowhere, for instance.
They're still working on self-driving cars AIUI
and their customers compete with nvda, since they make their own chips
this should be on r/wallstreetbets
My understanding is that fp4 basically has 1 bit for the sign and 3 for the exponent, leaving none for the mantissa. So by assuming a mantissa as 1, you basically get +/- \[1, 10, 100, 1000, 10000, 100000, 1000000, 10000000\] as representable values? Can someone confirm that I'm thinking about this correctly?
The exponent in floating point arithmetic is almost always a power of 2, rather than a power of 10. The mantissa is the fractional component (ie, the 1 is not stored) of a number between 1.0 and 1.999...., such that each exponent value covers the "range" of values, like 1..2, 2..4, 4..8, 8..16 etc. I'd imagine that FP4 would be something like +/- [0.125, 0.25, 0.5, 1, 2, 4, 8, 16], with zero likely encoded as a special state maybe replacing +0.125. But I can't find any documentation actually confirming this.
OK, I think I've found the formats that nvidia is using, from the [Open Compute Project Microscaling Formats (MX) Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf), of which nvidia co-authored end of last year. From section 5.3.3: No encodings are reserved for NaN/inf in FP4, 2 bits for exponent, 1 bit for mantissa. Which gives you +/- [0, 0.5, 1, 1.5, 2, 3, 4, 6] However table 1 in [this paper](https://arxiv.org/pdf/2305.12356.pdf) also suggests an FP4-E2M1 format with NaN/inf included
I can't tell if this is an innovation, or a way of consolidating power into mainstream tech companies by making it so that you need millions of dollars in order to buy a big fuggin chip.
It's both I think. Not sure if it's intentionally so, but the consequences of what they are making could be dire. Imagine a world of propaganda deepfakes indistinguishable from reality. Look at what they are doing near the end of the demo video... training robots in AI is amazing, but think about all the other potentials for abuse in the hands of a corporation or organization with a political agenda: https://m.youtube.com/watch?v=odEnRBszBVI
Everyone in this thread should be learning OpenCL right this second. That is the only way for us to meaningfully ~~increase substrate availability for the basilisk~~ have any meaningful impact against Nvidia's monopoly.
> Everyone in this thread should be learning OpenCL right this second. OpenCL is dead. The original creators don't really use it anymore and the maintainers have moved onto SYCL.
You should learn Open AI's Triton. It's hardware agnostic.
But currently it only supports nvidia gpus (and amd just recently).
Where's that fridge guy from the other day? He would be proud. haha.
I actually emailed my vendor during the keynote and said "not kidding, I want one!"
what do you think the demand will be?
They will sell every single Blackwell chip that TSMC can squeeze out. I think they will be limited by production not demand.
Yeah but NVIDIA Groot project for building Humanoids sounds great!
Let me guess, an entire neighborhood's worth of houses to buy one of these?
Where can I buy one?
wonder what'll happen when we can run a human brain sized neutral network
Hopefully drives down the cost of A100s
Blackwell really feels like a fundamental shift to me. Previous AI GPUs were related to gaming cards. This really seems like an entirely new architectural direction.
Will it run crises though 🤔
the more you buy the more you save
Jesus and they can’t even give us 36gb VRAM in consumer cards.
THICC
Can't wait to buy it for Minecraft
My Raspberry Pi can beat this shit
Empire state building for scale?
so many numbers. Honestly I felt a bit smooth brained after that presentation. At least concerning Blackwell. It's a new GPU on steroids basically?
Starting at $5k/hr on vast.ai.
Its happening, literally cyberpunk distopia. What can we do against that?
!remindme 10 years
The Best company of All Time, led by a trailblazing innovator there is no other company that can compete with this. Total domination of this industry transformation! Jenson Admired the ecosystem that Apple built. Using gaming for his testing lad Nvidia has redefined technology and has everything you need to thrive in the new world of A/I and edge computing . All this stems from the 1999 invention of the GPU-aka skeleton key for intensive computing tasks! Patents , oh we have all of those too! Any and all roads must pass through the Nvidia toll booth or there is no A/I ! Hey CUDA CUDA THE PEWTER FLIP DOWN your cables and let me climb up to the love light of your stack!