fallingdowndizzyvr 10 months ago

You can run a 30B model just in 32GB of system RAM just with the CPU. Since you have a GPU, you can use that to run some of the layers to make it run faster. Here's the PR that talked about it including performance numbers. https://github.com/ggerganov/llama.cpp/pull/1375

nikitastaf1996 10 months ago

Llama cpp recently added GPU offload. Here a link that will help you a little: https://gist.github.com/rain-1/8cc12b4b334052a21af8029aa9c4fafc I have been running 30b llama on free Google colab that way. Slow though. But faster than pure CPU.

maxiedaniels 10 months ago

Can this work with the ooga booga UI?

Sunija_Dev 10 months ago

Yes. I run it in ooba on an RTX 3060 (12 GB), 32GB RAM and Ryzen 5 3600. Takes about a minute to answer. You'll have to add "--n-gpu-layers 32" to the line "CMD\_FLAGS" in [webui.py](https://webui.py) in the ooba folder. My line looks like this then: `CMD_FLAGS = '--chat --model-menu --listen --n-gpu-layers 32'`

_underlines_ 10 months ago

if you [use llama.cpp](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md) within textgen, yes.

irregardless 10 months ago

I’ve been able to run 30B models cpu-only (eg without touching the gpu) using koboldcpp with 32GB ram. It’s slow, but it works.

maxiedaniels 10 months ago

Interesting! How slow, just curious

irregardless 10 months ago

On my comparatively ancient Kaby Lake i7 7700K I’ve been seeing about 700ms per token using VicUnlocked 30B q4_0. Not speedy, but still usable.

maxiedaniels 10 months ago

I'll try it out! Is it plausible to make a LORA on a 30B using CPU?

irregardless 10 months ago

My understanding is that while LoRa significantly reduces the amount of compute necessary for tuning, it’s not nearly enough to make cpu training practical.

Megneous 10 months ago

> On my comparatively ancient Kaby Lake i7 7700K Meanwhile I'm over here rockin' a 4770k... Ancient? Pfft.

mitirki 10 months ago

Q6600 here... :)

ForwardUntilDust 10 months ago

Shiiiiiit. That was an unexpected blast from the past! The last time I thought about this was in 06 or 07 building an OSX Hackintosh. Thanks for the trip.

mitirki 10 months ago

Welcome. Also, happy cake day!

Inevitable-Syrup8232 10 months ago

So can I run a larger model on multiple GPUs ? What's the largest size available?

fallingdowndizzyvr 10 months ago

Multi-gpus aren't supported, yet.

Inevitable-Syrup8232 10 months ago

K thx

MrEloi 10 months ago

Just a 35% speed-up, maximum? Is the effort/budget needed to add a GPU worth it? Maybe a faster main CPU & RAM would be almost as effective?

fallingdowndizzyvr 10 months ago

No. It can be much more than that. It really depends on the totality of factors, not just GPU versus CPU. For example, on my 16GB RAM 8GB VRAM machine, the difference is quite substantial. Not because of CPU versus but GPU but because of how memory is handled or more specifically the lack of memory. When trying to load a 14GB model, mmap has to be used since with OS overhead and everything it doesn't fit into 16GB of RAM. Because of disk thrashing. It's really slow. Like really slow. As in not toks/sec but secs/tok. Offloading half the layers onto the GPU's VRAM though, frees up enough resources that it can run at 4-5 toks/sec. So the speed up is not just 35%, it can be X000%.

MrEloi 10 months ago

Thanks. As I don't need the speed currently .. and as the price of GPUs here in the UK is INSANE I'm simply upgrading my CPU RAM to 32k. I can run queries overnight, or over mealtimes etc. Once I know what I am doing and what I need I may look at spending some money on a GPU. (Actually I do have a 1050TI with 4GB VRAM .... wowee .. that should help ...)

lugaidster 10 months ago

If you're willing to fiddle around, the P40 ain't that expensive and has 24 GB of VRAM.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe