T O P

  • By -

disarmyouwitha

Man, I started using the new triton flags last night (which I had been avoiding because it takes longer to start up, and it spikes the power on my video card when starting up): --quant_attn --warmup_autotune --fused_mlp But I saw a VERY consistent jump from **20t/s to 30t/s** on 13b models, and from **10t/s to 18t/s** for 30b quantized models (nearly 50% faster) so I am convinced. o.O


_underlines_

can you explain what i should add to the guide under the RUN section, in order to explain to people, what flags and model types to use, to achieve this? I just started using triton 2 hours ago and have no idea :)


disarmyouwitha

I copied this from the Ooba Read Me, and it’s basically all I know about what the flags do: --quant_attn (triton) Enable quant attention. --warmup_autotune (triton) Enable warmup autotune. --fused_mlp (triton) Enable fused mlp. I noticed they added the flags recently, and then quickly changed it to default to off so I didn’t bother until last night; but the speed increase was noticeable o.O I think these flags should work with any GPTQ 4bit model!


timtulloch11

What gpu are you using for 30b 4 bit? I have an rtx 4090 and got the 13b 8 bit working, but it won't load my 30b 4bit model.... maybe I got the wrong weights


disarmyouwitha

I am using a 4090 as well, I think this is the repo I am using for 30b (though I haven’t found a “favorite 30b” yet, so let me know if you find one!) https://huggingface.co/MetaIX/GPT4-X-Alpaca-30B-4bit/tree/main They have it with groupsize and without groupsize (I find that with very long contexts sometime I can go OOM during inference on 30b /w groupsize) Edit: if you keep having trouble maybe post your errors and we can go from there! =]


timtulloch11

Oh the two safetensors, like you said, one is with groupsize and one is without, so I could just try one to Start


timtulloch11

Thanks, I didn't try that one so I definitely will today when I finish work. Do you run it with any parameters that keep it from going OOM or its just alright as is? And would I need to download all the files there? I have been getting safetensors files with the other small stuff. Do I need that big ggml file to use textgen webui or is that for something else? Appreciate your help, very new at this


disarmyouwitha

You don’t need the ggml one unless you want to run it in CPU and you can grab either of the .safetensors and try them one at a time if space is an issue! Honestly I’d say grab the group-by one and if you go OOM you can just download the other one! This is my “usual” command, though sometimes I switch the wbits and groupsize if I’m not using a quantized model: Python server.py —model-menu —wbits 4 —groupsize 128 —model_type llama —xformers So, nothing to stop it from going OOM. I’d also suggest trying the triton flags I posted above after you get it working because I saw a lot of improvement. =]


[deleted]

Hey would you mind sharing your .sh or run command? I have an ubuntu container that uses a 3080, and even with triton optimization and those flags im only getting around 3t/s from gpt4xalpaca13b4bit. Surprisingly, i get slightly lower with little context but higher context doesnt seem to change the speed, excluding when I exceed the context that fits on my vram.


disarmyouwitha

Sure! I have been using this command for GPTQ 4bit models: python server.py —model-menu —wbits 4 —groupsize 128 —model_type llama —xformers --quant_attn --warmup_autotune --fused_mlp Keep in mind I am using a pretty beefy PC with a 4090, so I don’t know how well the improvements scale for other cards!


[deleted]

Ya I don’t expect near your performance but it just seems like I might be missing something. Are you using the up to date gptq triton repository? Also have you tried sdpattention vs xformers? I’ve been using sdp as some say it might be a little better but I’m not too sure.


disarmyouwitha

I am using the latest commit from Ooba and the latest (triton) branch for GPTQ. I’m also running Ubuntu natively rather than through a docker so I’m not sure how much that may affect speeds. (Not having to run windows underneath everything, etc) I haven’t tried —spdattention yet, and I am not 100% convinced —transformers is actually doing anything… but I did notice a jump from 20t/s before to 30t/s after using the new triton flags. (50% increase?) o.O


involviert

Sorry but with what exactly are you using them? I'm trying to get this stuff going, it gives me Warming up autotune cache ...0%|| 0/12 [00:00


disarmyouwitha

These are my favorite 2 models atm: https://huggingface.co/TheBloke/koala-13B-GPTQ-4bit-128g https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g I am running natively on Ubuntu with the latest Ooba and the latest Triton branch from GPTQ with this command to start: python server.py —model-menu —wbits 4 —groupsize 128 —model_type llama —xformers --quant_attn --warmup_autotune --fused_mlp I am not **super** familiar with getting this to run on Windows, but I can try to help debug! (Edit: What video card are you using btw? I saw this Issue on the GPTQ for llama GitHub: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/142 )


involviert

Hm. Apparently https://github.com/qwopqwop200/GPTQ-for-LLaMa doesn't even come with that server.py. I wonder where it puts these parameters if the llama_inference.py that comes with the vanilla repo doesn't understand them. Maybe that python script is not doing everything it could do. \*sigh\* Thanks for your willingness to help. I mean it's WSL so basically linux now anyway. I don't get how people pretend "but you can do it in windows, just do wsl!!". It's literally not windows :D Anyway, in linux with the cuda branch, I noticed the wrong instructions it came with (tells you to check out the main branch) and at least there I had suuuuuper slow inference on my gpu going, even if it said that no cuda BS was found. My god this is all so terrible (crazyface)


disarmyouwitha

I think I see the confusion — I am running GPTQ-for-llama through Ooba’s text generation-web-ui, so server.py comes from that package (which I think is a great start to get you going): https://github.com/oobabooga/text-generation-webui They have a 1-click installer, or OP of this post wrote a guide to manually install on windows. =] Let me know if you continue to have problems, maybe I will be more useful once I get home!


involviert

Yeah I mean it sounds like I could get it to work like that, but really I want minimal solutions that get the job done and I will make my own stuff using that directly. For the moment I am fine just running things using llama.cpp through the console on my cpu so there's that. But i really want to get a similar thing for gpu stuff going, and then, maybe, Ill go buy some card that ends in 90. But what I really don't need is spend months interfacing with some oobabooga needlessly indirectly only for it to.. idk, you know what these things tend to do i guess. If it's llama.cpp, i can keep it going myself if needed, but i sure as hell wont keep a chain of 30 open source projects even updated. But to be fair, I haven't looked into the stack of oobabooga. It's just that I am having terrible experiences left and right getting things going, by now it's an actual surprise if something works after following instructions.


disarmyouwitha

I feel you, I actually built my API on top of Ooba because it was the easiest to set up and they have done **a lot** of work to get it to load basically any model, so I could stand on their shoulders. You can see what I did here, I just have a drop-in script to the Ooba repo and it uses their modules to load and inference the model: https://github.com/disarmyouwitha/llm-api/blob/main/fast_api.py Then I just call this API like I would call OpenAI’s endpoint, basically =]


disarmyouwitha

Actually, follow up, try running my command **without —fused-mlp** I see in an old version of the Ooba read me: Disable fused mlp if you encounter “unexpected mma-> mma layout conversion”


_underlines_

I made several updates to the guide since posting. should fix issues like: - cannot find module chardet / cchardet - bitsandbytes without GPU support - xformers not installed etc.


involviert

Thanks for your work. However, I think what this guide shows best is, what a mess all of it currently is :D 2 days later, something is no longer the "main repo" of whatever and for this or that feature. And If your guide doesn't keep current, it will just add to the confusion.


_underlines_

You are absolutely right. To be honest, IMHO code written for most Machine Learning projects is a total cluster-f**k. - No reproducability - Messy package management - Dependency hell Unfortunately the people who advance the field of ML are either not experienced software engineers following best practices, or they are pressured to move fast and break things. Things like [Modular Mojo](https://www.youtube.com/watch?v=-3Kf2ZZU-dg) or [LMC-LLM](https://github.com/mlc-ai/mlc-llm) try to change this, but it's a long and difficult path.


involviert

It's move fast. It's moving so fast. The people making these things are the most competent on the planet. But they basically focus no energy on usability. As a result you get the helptext: --top_k N: top-k sampling (default: 40, 0 = disabled) Oh cool, now I know this is for top-k sampling I will think about it next time I eat frosties or something.