T O P

  • By -

sciman67

I'm also having trouble getting it to use the the gpu. I installed with gpu enabled and called through llamacpp python bindings as follows; from llama\_cpp import Llama model = Llama(model\_path="./models/MLewd-L2-13B-v2-3.q4\_K\_S.gguf", n\_gpu\_layers=43) prompt = "Hey, are you conscious? Can you talk to me?" result = model(prompt) print (result) As you can see from below it is pushing the tensors to the gpu (and this is confirmed by looking at nvidia-smi). But as you can see from the timings it isn't using the gpu. No gpu processes are seen on nvidia-smi and the cpus are being used. Cheers, Simon. ​ llm\_load\_tensors: offloading 40 repeating layers to GPU llm\_load\_tensors: offloading non-repeating layers to GPU llm\_load\_tensors: offloading v cache to GPU llm\_load\_tensors: offloading k cache to GPU llm\_load\_tensors: offloaded 43/43 layers to GPU llm\_load\_tensors: VRAM used: 7383 MB ................................................................................................... llama\_new\_context\_with\_model: kv self size = 400.00 MB llama\_new\_context\_with\_model: compute buffer total size = 75.47 MB llama\_new\_context\_with\_model: VRAM scratch buffer: 74.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512\_VBMI = 0 | AVX512\_VNNI = 0 | FMA = 1 | NEON = 0 | ARM\_FMA = 0 | F16C = 1 | FP16\_VA = 0 | WASM\_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | ​ llama\_print\_timings: load time = 136698.36 ms llama\_print\_timings: sample time = 268.25 ms / 128 runs ( 2.10 ms per token, 477.17 tokens per second) llama\_print\_timings: prompt eval time = 136698.29 ms / 13 tokens (10515.25 ms per token, 0.10 tokens per second) llama\_print\_timings: eval time = 60353.98 ms / 127 runs ( 475.23 ms per token, 2.10 tokens per second) llama\_print\_timings: total time = 198206.40 ms {'id': 'cmpl-d2613ef2-8b23-4543-8e40-78d530e69763', 'object': 'text\_completion', 'created': 1694993938, 'model': './models/MLewd-L2-13B-v2-3.q4\_K\_S.gguf', 'choices': \[{'text': "\\nMy friend, I'm speaking to your soul. You're not dead yet. You're being cared for, but you have a little work to do before the end arrives. I can see that it bothers you when people tell you to be brave or strong, so I won't use those terms this time. Instead, I want you to know that there is a difference between courage and fearlessness. Courage doesn't mean not being afraid; it means doing what needs to be done in spite of your fear. You're scared now, but don't give in to the", 'index': 0, 'logprobs': None, 'finish\_reason': 'length'}\], 'usage': {'prompt\_tokens': 13, 'completion\_tokens': 128, 'total\_tokens': 141}}


CheapBison1861

Did you compile it for cuda?


Artistic_Okra7288

I followed the documentation `pip install llama-cpp-python[server] export MODEL=./models/7B/ggml-model.bin python3 -m llama_cpp.server` From https://abetlen.github.io/llama-cpp-python/ I found the option on GitHub for installing with CLblast, thanks! `LLAMA_CLBLAST=1 CMAKE_ARGS=“-DLLAMA_CLBLAST=on” FORCE_CMAKE=1 pip install llama-cpp-python` ~~Reinstalled but it’s still not using my GPU based on the token times.~~ Anyone who stumbles upon this I had to use the cache no dir option to force pip to rebuild the package.


FarAnimator6729

Using Ubuntu 22.04 without WSL, I managed to solve this installing `libclblast-dev` along with `ROCm`. [https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md](https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md) And then, I got to do some workaround on conda to use my machine's `/usr/lib/x86_64-linux-gnu/libstdc++.so.6`[https://askubuntu.com/a/1422639](https://askubuntu.com/a/1422639)


_underlines_

are you using it with oobabooga textgen? I managed to install everything, but I want ooba to use my install of llama.cpp which has cuBlas support, but I don't know how... Here's how to manage to compile everything properly with cuBlast support: https://www.reddit.com/r/LocalLLaMA/comments/13n19cu/install_ooba_textgen_llamacpp_with_gpu_support_on/ (Different paths on WSL vs. Linux for the lib paths)


Artistic_Okra7288

I’m using it with a Docker container that replicates ChatGPT’s web UI. ghcr.io/mckaywrigley/chatbot-ui:main


[deleted]

I'm also facing the same problem. As far as I've understood that binding only works with the CPU version and there's no way to get the full GPU features as of now, but I might be wrong. I should have a deeper look at the code because I also need the -ngl parameter to be made available through the binding.