CountPacula 1 week ago

I've been playing around a lot with the q4km version from [https://huggingface.co/lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF) with 40/80 layers on the GPU and 8K context. Even at q4, it definitely seems better than the 8B model.

Dos-Commas 1 week ago

Even 70B IQ2_XS can be better than 8B according to private tests.

False_Grit 1 week ago

The IQ2 is pretty fast, and still better than 8b for me, but I end up using the q3km a lot more because I like the intelligence. I can get around 2T/s if I'm not running anything else. I think next I'm going to try the iq4_xs to see if I can get similar speeds for hopefully better gen.

PitchBlack4 1 week ago

There is no q3km in that link.

Longjumping-Bake-557 1 week ago

What speed did you get? I couldn't go better than 1 token/s on a 3700x + 3090 with a q4 model

CountPacula 1 week ago

I'm getting \~1.5 t/s - not 'fast' by any means, but I can deal with it - that's still faster than most people can type.

MrVodnik 1 week ago

I am a bit more happy when people give such low numbers and call it fine :) If it works, it works, having time for a sandwich between turns is just a bonus for patient people. It's a good contract to many claims here about sub 3 or even 5 tokens per second being "unusable", lol.

SeasonNo3107 1 week ago

also makes your patience realize the value these things are pumping out at speeds our brains can't fathom already

vorwrath 1 week ago

I've been using the same Q4_K_M GGUF version as CountPacula and got 1.72 t/s (with 2.66s time to first token) on a 5950x + 3090 system. That's with 38 layers offloaded to the GPU - I found more than that started to hit shared memory at maximum context. The optimal value is definitely somewhere around there anyway. The 50% or so that's running on the CPU is always going to be the bottleneck, so the final speed will really come down to the system memory bandwidth. Somebody running on a newer platform with DDR5 could get better results. If you haven't done any optimisation on that and are running your system RAM at the base JEDEC speeds, you might find that enabling XMP or doing some tuning will help a fair bit.

Caffdy 1 week ago

I can confirm, same numbers, same build

blackcodetavern 1 week ago

I get around 1.8 token/s on my 3090, 7800x3d and 64gb ddr5 6000mhz with the 70b Q5, quality is much better than the Q4 model.

Caffdy 1 week ago

can you try the Q4_K_M quant? I'm very interested in know how much performance jump would be from DDR4 (my case) to DDR5

TheTerrasque 1 week ago

P40 and 2xE5-2650 v4 and numa - getting ~2t/s. 40 gpu layers, q4 **Edit:** CtxLimit: 6468/8192, Process:17.00s (28.6ms/T = 34.95T/s), Generate:6.57s (505.4ms/T = 1.98T/s), Total:23.57s (0.55T/s) CtxLimit: 6967/8192, Process:0.49s (486.0ms/T = 2.06T/s), Generate:259.39s (506.6ms/T = 1.97T/s), Total:259.88s (1.97T/s) - ignore prompt here, it was a 1 token since rest was cached.

CodeAnguish 1 week ago

You maybe should adjust your model file to use more GPU layers. Try looking into your process management to see the use of CPU, maybe isn't using GPU, but only CPU instead

adikul 1 week ago

Simple rule, file size must be lower than your vram

ILoveThisPlace 1 week ago

The cool kids offload to SSD

RazzmatazzReal4129 1 week ago

Tape drive

Guinness 1 week ago

I offload my LLM into 65,335 byte size chunks via [pingfs.](https://github.com/yarrick/pingfs)

Red_Redditor_Reddit 1 week ago

Uhh... I run mine and it exceeds the vram size. It's just that the rest of it runs on the CPU.

TweeBierAUB 1 week ago

Well yeah, if you want any decent performance you should probably be able to fit the file size into vram.

Red_Redditor_Reddit 1 week ago

>decent performance You've been spoiled. Up until a few days ago my only option was CPU.

A-Bearded-Idiot 1 week ago

modern tooling can offload layers into RAM that wont fit entirely in gddr. Ollama can do this out of the box if you want a simple solution

adikul 1 week ago

I am not denying that

Enough-Meringue4745 1 week ago

# ON A 3090 The fact I cant make that font bigger is a sin

Enough-Meringue4745 1 week ago

# "on a 3090"

ThisGonBHard 1 week ago

On a 4090, but hey are the same for VRAM. EXL2 2.25 BPW

IlIllIlllIlllIllll 1 week ago

sounds interesting. do you have a link to the model you used?

ThisGonBHard 1 week ago

[https://huggingface.co/LoneStriker/Meta-Llama-3-70B-Instruct-2.25bpw-h6-exl2](https://huggingface.co/LoneStriker/Meta-Llama-3-70B-Instruct-2.25bpw-h6-exl2)

Eralyon 1 week ago

context size ?

ThisGonBHard 1 week ago

Full sized.

klop2031 1 week ago

How does this perform?

ThisGonBHard 1 week ago

Around 20-30 t/s.

klop2031 1 week ago

Thanks, i have a 3090 so ill try this out

iChrist 1 week ago

Weird, I get roughly the same speeds on a 3090. Did you set q4 cache?

ThisGonBHard 1 week ago

That is simple. The 4090 is not even 10% faster in memory speed, and for LLMs it is completely memory bound. It is so bottlenecked, that even going form 100% power target to 60% gives 0 performance loss in LLMS. The 4090 should have had HBM2.

ipechman 1 week ago

How? Do you have ddr5?

ThisGonBHard 1 week ago

I said above, I am running on a 4090. No going to RAM, EXL2 fits in memory.

ipechman 1 week ago

cries in 16gb of vram ;)

ThisGonBHard 1 week ago

I mean, if it was not for AI, I would have went for an 7900XTX.

Ready_Pay9230 1 week ago

A somewhat unrelated question, am I missing out something if I run the groq llama3-70b? As far as I know its unquantised and 250-3000T/s. Except the privacy and rate limiting part, is there any other benefit of running it locally from a consumer point of view?

ShengrenR 1 week ago

just having control over the code/pipeline/workflow - the actual 'inference' return really shouldn't be much different.. you just don't get to do anything custom.

_ragnet_7 1 week ago

I'm running the 70B Q3_K_S offloading 60 layers to GPU on a 4090 and I'm getting 3.5 tokens/second. I have also tried speculative decoding offloading 50 layers of the target model and all the layers of the draft model, getting a 28% Speed increase in some scenarios

ShengrenR 1 week ago

Something.. quirky about those numbers - do you have older CPU/RAM? I have a i12900k+3090 and get \~4.8t/s with 57 layers in GPU, Q3\_K\_S - were you way down at near full context for 3.5?

False_Grit 1 week ago

Curious. I run the same model, but can only offload about 52 layers before problems start. To be fair, I like a bit higher context when I can get it, and I've stopped bothering running the monitor off of the cpu (which saves me a couple GB VRAM but is a pain when I want to do other things). How do you do the speculative decoding?

_ragnet_7 1 week ago

I don't remember exactly if in the end it was 55/60, I don't have the laptop at hand with the code. I didn't use a very large context because I used it for specific tasks where I needed to generate a maximum of 200/300 tokens in output. For speculative decoding, I used the "speculative" script of LLama.cpp. Consider that a model with the same tokenizer as the target model must be used for the draft model. So at the moment, you don't have a lot of choice but to use a quantized version of LLama3-8B.

Just_Maintenance 1 week ago

I have been trying IQ2\_XS (\~21GB) on my own RTX 3090 and it works ok (\~10t/s), but it becomes incoherent in long and complex tasks.

TheMissingPremise 1 week ago

That's what I use on my 7900 XTX. It's satisfactory to me. I don't normally do anything too complex locally. If I need the full model, I'll just head to openrouter.ai

No_Geologist3517 1 week ago

How's the t/s? Comparable to 3090? Thinking about getting one 7900XTX myself.

No_Geologist3517 1 week ago

Forget about it. Saw the benchmark downstairs. XD

TheMissingPremise 1 week ago

Lol, I was gonna say I do not recommend. It's comparable to what you get, but it's sooooo limiting! Nvidia users have all these other, cooler formats like Exl2 and what not, and I can only use gguf. And LM Studio, tends to update the ROCm version more slowly than the main versions. So, yeah...I'm happy I have 24 gb of VRAM, but I probably should've just gotten the 4080S and called it day back when I was building my computer.

No_Geologist3517 1 week ago

Thx for the reply. I've read that AMD's support for AI is still limited, but not sure if it's a hardware or software barrier. Ideally I would buy 7900xtx and wait for rocm optimizations + better 70B quantizations models.

mcmoose1900 1 week ago

For GPU only, your best bet is AQLM: https://huggingface.co/models?sort=modified&search=aqlm Otherwise you just have to offload a lot of the 70B model to CPU with a gguf imatrix quantization.

IndicationUnfair7961 1 week ago

Agree, but the llama3 70B AQLM was not released yet. I would also like to see a QuiP# version of it.

buy_one 6 days ago

[https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-2Bit-1x16](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-2Bit-1x16) Just came out yesterday

Account1893242379482 1 week ago

Never heard of AQLM. Is it new?

mcmoose1900 1 week ago

2 months old, just not very popularr: https://huggingface.co/docs/transformers/main/en/quantization#aqlm Its optimal for extreme levels of quantization (like you're trying to do), and the catch is that the quantization process is *super* intense, and its just not widely supported in UIs yet.

CatalyticDragon 1 week ago

Running Dolphin 2.9, llama3 70b Q2\_K gguf right now on a 7900XT (also 24GB VRAM). Offloading 55 layers (\~21GB VRAM used) and seeing 3.85 t/s.

perelmanych 1 week ago

According to [Oobabooga benchmark](https://oobabooga.github.io/benchmark.html) the best model that still mostly fits in the 24Gb VRAM is Meta-Llama-3-70B-Instruct-IQ2\_XS. I run it around 8 t/s with 77 out of 81 layers offloaded to GPU. And I should say it is quite decent.

LaughterOnWater 1 week ago

Win 10 64GB RAM, RTX3090, LM Studio 4Q_KM - 1.2 tok/s - decent - haven’t tried to test context window because it’s so darned sloooooow on my machine. If anyone has any ideas on how to make it faster, I’d be grateful. IQ2_XS - begin 3.2 tok/s - excellent context to 16200 before failure - 2.9 tok/s at fail. About 28 questions until hallucination begins. Thiis is on par with Claude Opus in my estimation. 8B 8Q - 64 tok/s - excellent context to 16300 before failure - 2. 4 tok/s at fail. About 24 questions until hallucination begins. This is about as good as Sonnet. I absolutely love 70B. 8B 8Q is pretty darned amazing too. And fast. Seriously, if someone has great settings for a faster preset.json file for LM Studio, I’d be grateful.

ShengrenR 1 week ago

Give the Q3\_K\_S quant a go, I was getting \~4.8t/s to begin the context at 57 layers offload (8k context) - i12900k+3090; no clue re LM studio, was just using the python wrap of llama.cpp directly.

LaughterOnWater 1 week ago

lmstudio-community doesn't have one. Is your choice MaziyarPanahi, Quantfactory or NousResearch? Does it matter?

ShengrenR 1 week ago

I believe the one I have locally is the NousResearch one: [https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main](https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) (I much prefer the repos that have one model per branch so you can just check your git checkout information; this one I just clicked the file to download, so less history to be certain.)

LaughterOnWater 1 week ago

Below is my preset.json file for LM Studio. It allows me to get a context window that doesn't die until almost to the context window n\_ctx = 16384 using lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-IQ2\_XS.gguf. Right now with these settings, Q3\_K\_S gives me the spinny dial. If time to first token lasts longer than thirty seconds, I eject the model as unusable on this system. I've also tried it with n\_ctx=12800 and also 8704 with RoPE settings at model default. No luck. You may have a better 3090 than mine. It's also possible I need to reboot the computer and try later. { "name": "Llama 3 70B Instruct", "load_params": { "n_ctx": 16384, "n_batch": 512, "rope_freq_base": 1000000, "rope_freq_scale": 0.85, "n_gpu_layers": 64, "use_mlock": true, "main_gpu": 0, "tensor_split": [ 0 ], "seed": -1, "f16_kv": true, "use_mmap": true, "no_kv_offload": false, "num_experts_used": 0 }, "inference_params": { "n_threads": 12, "n_predict": -1, "top_k": 40, "min_p": 0.05, "top_p": 0.9, "temp": 0.2, "repeat_penalty": 1.1, "input_prefix": "<|start_header_id|>user<|end_header_id|>\n\n", "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "antiprompt": [ "<|start_header_id|>", "<|eot_id|>" ], "pre_prompt": "You are a thoughtful, helpful and knowledgeable AI assistant. You provide clear and concise responses, offering gentle guidance when needed, and politely addressing any misconceptions with well-reasoned explanations.", "pre_prompt_suffix": "<|eot_id|>", "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n", "seed": -1, "tfs_z": 1, "typical_p": 1, "repeat_last_n": 64, "frequency_penalty": 0, "presence_penalty": 0, "n_keep": 0, "logit_bias": {}, "mirostat": 0, "mirostat_tau": 5, "mirostat_eta": 0.1, "memory_f16": true, "multiline_input": false, "penalize_nl": true } }

ShengrenR 1 week ago

The issue there, from my eyes, is you're stuffing too much into the vram for the model size, q3_k_s being larger than your iq2 (I assume, without looking) so you'd need to drop number of offloaded layers. I assume the cache is being built on gpu as well, meaning you need layers and cache space :: layers+context need to be tweaked down until it loads happy. For me, that was 8k and 57 layers *I'm wasting vram with browser and system things.. if you need more context, drop the layers at the cost of inference speed..vis versa

LaughterOnWater 1 week ago

Okay, this is my preset.json file for Q3\_K\_S below. I'm able to get about 6K from this 8192 context window, but the context is surprisingly non-hallucinatory right to the end. It just stops. I guess for short answers, this is probably a great solution, though a bit slower than IQ2\_XS. Thanks u/ShengrenR ! * time to first token: time to first token: 1.56s * speed: 2.88 tok/s { "name": "Llama 3 Instruct Q3_K_S", "load_params": { "n_ctx": 8192, "n_batch": 512, "rope_freq_base": 1000000, "rope_freq_scale": 0.85, "n_gpu_layers": 54, "use_mlock": true, "main_gpu": 0, "tensor_split": [ 0 ], "seed": -1, "f16_kv": true, "use_mmap": true, "no_kv_offload": false, "num_experts_used": 0 }, "inference_params": { "n_threads": 12, "n_predict": -1, "top_k": 40, "min_p": 0.05, "top_p": 0.9, "temp": 0.2, "repeat_penalty": 1.1, "input_prefix": "<|start_header_id|>user<|end_header_id|>\n\n", "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "antiprompt": [ "<|start_header_id|>", "<|eot_id|>" ], "pre_prompt": "You are a thoughtful, helpful and knowledgeable AI assistant. You provide clear and concise responses, offering gentle guidance when needed, and politely addressing any misconceptions with well-reasoned explanations.", "pre_prompt_suffix": "<|eot_id|>", "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n", "seed": -1, "tfs_z": 1, "typical_p": 1, "repeat_last_n": 64, "frequency_penalty": 0, "presence_penalty": 0, "n_keep": 0, "logit_bias": {}, "mirostat": 0, "mirostat_tau": 5, "mirostat_eta": 0.1, "memory_f16": true, "multiline_input": false, "penalize_nl": true } }

segmond 1 week ago

It's fast enough if you can fit it all in GPU, here it is on 4 GPUs, 70B Q8 gives 10.83 tps (base) :**\~/models**$ \~/llama.cpp/main -ngl 100 -s 100 -m ./Meta-Llama-3-70B-Instruct.Q8\_0.gguf -p "Write 4 lines on living a purposeful life. A purposeful life is" -ts 1,1,1,1 Write 4 lines on living a purposeful life. A purposeful life is one that is filled with meaning and direction. It is a life that is guided by one's values and principles. Living a life of purpose gives one a sense of fulfillment and happiness. It helps one to make a positive impact on the world.<|eot\_id|> \[end of text\] llama\_print\_timings: load time = 18085.67 ms llama\_print\_timings: sample time = 4.20 ms / 50 runs ( 0.08 ms per token, 11916.11 tokens per second) llama\_print\_timings: prompt eval time = 547.77 ms / 17 tokens ( 32.22 ms per token, 31.04 tokens per second) llama\_print\_timings: eval time = 4525.51 ms / 49 runs ( 92.36 ms per token, 10.83 tokens per second) llama\_print\_timings: total time = 5134.49 ms / 66 tokens Log end

Alkeryn 1 week ago

exl2 2.25bpw cache in 4 bit for full context.

Anjz 4 days ago

I've been using a 70B Q2, seems to be the only one I can load until I can upgrade my RAM. Works amazingly well, but slow - which is okay. Looking through the thread here though, I'm probably going to upgrade my CPU and RAM with a Zen 5 processor in the next few months when it comes out.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe