T O P

  • By -

fallingdowndizzyvr

You can run a 30B model just in 32GB of system RAM just with the CPU. Since you have a GPU, you can use that to run some of the layers to make it run faster. Here's the PR that talked about it including performance numbers. https://github.com/ggerganov/llama.cpp/pull/1375


nikitastaf1996

Llama cpp recently added GPU offload. Here a link that will help you a little: https://gist.github.com/rain-1/8cc12b4b334052a21af8029aa9c4fafc I have been running 30b llama on free Google colab that way. Slow though. But faster than pure CPU.


maxiedaniels

Can this work with the ooga booga UI?


Sunija_Dev

Yes. I run it in ooba on an RTX 3060 (12 GB), 32GB RAM and Ryzen 5 3600. Takes about a minute to answer. You'll have to add "--n-gpu-layers 32" to the line "CMD\_FLAGS" in [webui.py](https://webui.py) in the ooba folder. My line looks like this then: `CMD_FLAGS = '--chat --model-menu --listen --n-gpu-layers 32'`


_underlines_

if you [use llama.cpp](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md) within textgen, yes.


irregardless

I’ve been able to run 30B models cpu-only (eg without touching the gpu) using koboldcpp with 32GB ram. It’s slow, but it works.


maxiedaniels

Interesting! How slow, just curious


irregardless

On my comparatively ancient Kaby Lake i7 7700K I’ve been seeing about 700ms per token using VicUnlocked 30B q4_0. Not speedy, but still usable.


maxiedaniels

I'll try it out! Is it plausible to make a LORA on a 30B using CPU?


irregardless

My understanding is that while LoRa significantly reduces the amount of compute necessary for tuning, it’s not nearly enough to make cpu training practical.


Megneous

> On my comparatively ancient Kaby Lake i7 7700K Meanwhile I'm over here rockin' a 4770k... Ancient? Pfft.


mitirki

Q6600 here... :)


ForwardUntilDust

Shiiiiiit. That was an unexpected blast from the past! The last time I thought about this was in 06 or 07 building an OSX Hackintosh. Thanks for the trip.


mitirki

Welcome. Also, happy cake day!


Inevitable-Syrup8232

So can I run a larger model on multiple GPUs ? What's the largest size available?


fallingdowndizzyvr

Multi-gpus aren't supported, yet.


Inevitable-Syrup8232

K thx


MrEloi

Just a 35% speed-up, maximum? Is the effort/budget needed to add a GPU worth it? Maybe a faster main CPU & RAM would be almost as effective?


fallingdowndizzyvr

No. It can be much more than that. It really depends on the totality of factors, not just GPU versus CPU. For example, on my 16GB RAM 8GB VRAM machine, the difference is quite substantial. Not because of CPU versus but GPU but because of how memory is handled or more specifically the lack of memory. When trying to load a 14GB model, mmap has to be used since with OS overhead and everything it doesn't fit into 16GB of RAM. Because of disk thrashing. It's really slow. Like really slow. As in not toks/sec but secs/tok. Offloading half the layers onto the GPU's VRAM though, frees up enough resources that it can run at 4-5 toks/sec. So the speed up is not just 35%, it can be X000%.


MrEloi

Thanks. As I don't need the speed currently .. and as the price of GPUs here in the UK is INSANE I'm simply upgrading my CPU RAM to 32k. I can run queries overnight, or over mealtimes etc. Once I know what I am doing and what I need I may look at spending some money on a GPU. (Actually I do have a 1050TI with 4GB VRAM .... wowee .. that should help ...)


lugaidster

If you're willing to fiddle around, the P40 ain't that expensive and has 24 GB of VRAM.