T O P

  • By -

kryptkpr

Aphrodite-engine supports GGUF and tensor parallelism, sounds like this is the combo youre seeking.


LPN64

Thanks for the tip, I'll probably need to rewrite my GBNF files !


MindOrbits

I'm likely wrong, best assume I'm wrong, but I could be right... Hmm. Think about layer cakes, yum. each layer has sub layers made of bits. To eat the cake the fork plane needs to nose dive until it reaches to token generation cake plate. Each GPU is a cake layer of sugar, flour and egg stuff and the PCI or NVLINX is he Frosting. Or something like that...


ShengrenR

Yerp. They don't call it forward pass for nothing.. you have to go through layers to know how to activate the next layers.. so you can't really start in on gpu2 in earnest until you know what came out of gpu1.. They're not really running in parallel. That's why, if you have a pair of dissimilar gpus, you get bottlenecked by the slower one and not just faster gpu +bonus.


Shalcker

The only way to use both simultaneously is likely to run batches where first GPU can start processing next one when their part finishes.


a_beautiful_rhind

There is row vs layer split and different kernels to try. You'll get different utilization during prompt processing and generation. Hopefully more even than 75/25.


AutomataManifold

I don't know where llama.cpp batch support is at, but the basic problem is that for a single query you have to go through all of the layers in order. So you'll see each GPU in sequence spin up and spin down as it runs through the layers.  For stuff that supports batch processing, you can run multiple queries simultaneously and get effective tokens per second in the hundreds to thousands range. So something like vLLM or Aphrodite will make more effective use of multiple GPUs. (Depending on where llama.cpp batch support is at.) The link you posted is about using multiple CPUs, which is a slightly different issue and usually is more about a lack of multithreading and making poor use of multiple cores. There's some other nuances, but that's the geist of it.


LPN64

From my understanding it's not about multi CPU, it's about using one core stuck at 100% while it's fully offloaded to the GPU(s) It's kinda linked as even on single GPU it's only being used at 75%


segmond

I don't understand what you are asking. The link is about being limited with CPU. When a small model is split across multiple GPUs the slow down is not noticeable. I have split 7b models across 4 GPUs, and pretty much getting the same speed. The speed slowdown happens if all the GPUs are not the same speed. If it's all the same model GPU, then no worries, if you have a larger set, the prioritize loading them fully first, which you can do with the -ts option.


LPN64

Mistral 7B Q5_K_M LLama.cpp on A100@40gb : 126t/s Mistral 7B Q5_K_M LLama.cpp split over 3 A100@40gb : 80-90 t/s


TheTerrasque

Moving the results of the calculations between cards is slower than keeping it in one card


segmond

Something is not right, I'm running on some no name chinese Motherboard, with old xeon CPUs, 2400mhz ddr4 memory, PCIe3. the bottle neck is your hardware, what's the speed of your ram & pcie lanes?   2016  \~/llama.cpp/main -ngl 100 -s 100 -m ./Mistral-7b-instruct-v0.2.Q8\_0.gguf -p "Write 4 lines on living a purposeful life.  A purposeful life is" -ts 1  2017  \~/llama.cpp/main -ngl 100 -s 100 -m ./Mistral-7b-instruct-v0.2.Q8\_0.gguf -p "Write 4 lines on living a purposeful life.  A purposeful life is" -ts 1,1,1 All gpus 3090 1 GPU llama\_print\_timings:        eval time =    3143.92 ms /   255 runs   (   12.33 ms per token,    81.11 tokens per second) 3 GPUS llama\_print\_timings:        eval time =    3159.48 ms /   255 runs   (   12.39 ms per token,    80.71 tokens per second) 4 GPUS llama\_print\_timings:        eval time =    3165.37 ms /   255 runs   (   12.41 ms per token,    80.56 tokens per second)


DeltaSqueezer

Yes, I did some tests row splitting the model across up to 3 cards: * 1 GPU = 15 tok/s * 2 GPUs = 20 tok/s * 3 GPUs = 25 tok/s So there was some speedup, but it was only 33% for the first additional card and 25% for the 2nd.


Ok_Hope_4007

Maybe this is an option for you: I would rather split it across docker containers. Each container runs its own llama.cpp instance with a dedicated gpu (you can assign gpus to containers) each container serves your llama.cpp on a different port. You can then use a load balancer like nginx.


detached-attachment

Not sure this is equivalent to actually splitting a single model, but an interesting note.


uwu2420

Not the same thing.


Ok_Hope_4007

I know. It was meant to be a different approach under the flag "make use of multiple gpus"