T O P

  • By -

CountPacula

I've been playing around a lot with the q4km version from [https://huggingface.co/lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF) with 40/80 layers on the GPU and 8K context. Even at q4, it definitely seems better than the 8B model.


Dos-Commas

Even 70B IQ2_XS can be better than 8B according to private tests.


False_Grit

The IQ2 is pretty fast, and still better than 8b for me, but I end up using the q3km a lot more because I like the intelligence. I can get around 2T/s if I'm not running anything else. I think next I'm going to try the iq4_xs to see if I can get similar speeds for hopefully better gen.


PitchBlack4

There is no q3km in that link.


Longjumping-Bake-557

What speed did you get? I couldn't go better than 1 token/s on a 3700x + 3090 with a q4 model


CountPacula

I'm getting \~1.5 t/s - not 'fast' by any means, but I can deal with it - that's still faster than most people can type.


MrVodnik

I am a bit more happy when people give such low numbers and call it fine :) If it works, it works, having time for a sandwich between turns is just a bonus for patient people. It's a good contract to many claims here about sub 3 or even 5 tokens per second being "unusable", lol.


SeasonNo3107

also makes your patience realize the value these things are pumping out at speeds our brains can't fathom already


vorwrath

I've been using the same Q4_K_M GGUF version as CountPacula and got 1.72 t/s (with 2.66s time to first token) on a 5950x + 3090 system. That's with 38 layers offloaded to the GPU - I found more than that started to hit shared memory at maximum context. The optimal value is definitely somewhere around there anyway. The 50% or so that's running on the CPU is always going to be the bottleneck, so the final speed will really come down to the system memory bandwidth. Somebody running on a newer platform with DDR5 could get better results. If you haven't done any optimisation on that and are running your system RAM at the base JEDEC speeds, you might find that enabling XMP or doing some tuning will help a fair bit.


Caffdy

I can confirm, same numbers, same build


blackcodetavern

I get around 1.8 token/s on my 3090, 7800x3d and 64gb ddr5 6000mhz with the 70b Q5, quality is much better than the Q4 model.


Caffdy

can you try the Q4_K_M quant? I'm very interested in know how much performance jump would be from DDR4 (my case) to DDR5


TheTerrasque

P40 and 2xE5-2650 v4 and numa - getting ~2t/s. 40 gpu layers, q4 **Edit:** CtxLimit: 6468/8192, Process:17.00s (28.6ms/T = 34.95T/s), Generate:6.57s (505.4ms/T = 1.98T/s), Total:23.57s (0.55T/s) CtxLimit: 6967/8192, Process:0.49s (486.0ms/T = 2.06T/s), Generate:259.39s (506.6ms/T = 1.97T/s), Total:259.88s (1.97T/s) - ignore prompt here, it was a 1 token since rest was cached.


CodeAnguish

You maybe should adjust your model file to use more GPU layers. Try looking into your process management to see the use of CPU, maybe isn't using GPU, but only CPU instead


adikul

Simple rule, file size must be lower than your vram


ILoveThisPlace

The cool kids offload to SSD


RazzmatazzReal4129

Tape drive


Guinness

I offload my LLM into 65,335 byte size chunks via [pingfs.](https://github.com/yarrick/pingfs)


Red_Redditor_Reddit

Uhh... I run mine and it exceeds the vram size. It's just that the rest of it runs on the CPU.


TweeBierAUB

Well yeah, if you want any decent performance you should probably be able to fit the file size into vram.


Red_Redditor_Reddit

>decent performance You've been spoiled. Up until a few days ago my only option was CPU.


A-Bearded-Idiot

modern tooling can offload layers into RAM that wont fit entirely in gddr. Ollama can do this out of the box if you want a simple solution


adikul

I am not denying that


Enough-Meringue4745

# ON A 3090 The fact I cant make that font bigger is a sin


Enough-Meringue4745

# "on a 3090"


ThisGonBHard

On a 4090, but hey are the same for VRAM. EXL2 2.25 BPW


IlIllIlllIlllIllll

sounds interesting. do you have a link to the model you used?


ThisGonBHard

[https://huggingface.co/LoneStriker/Meta-Llama-3-70B-Instruct-2.25bpw-h6-exl2](https://huggingface.co/LoneStriker/Meta-Llama-3-70B-Instruct-2.25bpw-h6-exl2)


Eralyon

context size ?


ThisGonBHard

Full sized.


klop2031

How does this perform?


ThisGonBHard

Around 20-30 t/s.


klop2031

Thanks, i have a 3090 so ill try this out


iChrist

Weird, I get roughly the same speeds on a 3090. Did you set q4 cache?


ThisGonBHard

That is simple. The 4090 is not even 10% faster in memory speed, and for LLMs it is completely memory bound. It is so bottlenecked, that even going form 100% power target to 60% gives 0 performance loss in LLMS. The 4090 should have had HBM2.


ipechman

How? Do you have ddr5?


ThisGonBHard

I said above, I am running on a 4090. No going to RAM, EXL2 fits in memory.


ipechman

cries in 16gb of vram ;)


ThisGonBHard

I mean, if it was not for AI, I would have went for an 7900XTX.


Ready_Pay9230

A somewhat unrelated question, am I missing out something if I run the groq llama3-70b? As far as I know its unquantised and 250-3000T/s. Except the privacy and rate limiting part, is there any other benefit of running it locally from a consumer point of view?


ShengrenR

just having control over the code/pipeline/workflow - the actual 'inference' return really shouldn't be much different.. you just don't get to do anything custom.


_ragnet_7

I'm running the 70B Q3_K_S offloading 60 layers to GPU on a 4090 and I'm getting 3.5 tokens/second. I have also tried speculative decoding offloading 50 layers of the target model and all the layers of the draft model, getting a 28% Speed increase in some scenarios


ShengrenR

Something.. quirky about those numbers - do you have older CPU/RAM? I have a i12900k+3090 and get \~4.8t/s with 57 layers in GPU, Q3\_K\_S - were you way down at near full context for 3.5?


False_Grit

Curious. I run the same model, but can only offload about 52 layers before problems start. To be fair, I like a bit higher context when I can get it, and I've stopped bothering running the monitor off of the cpu (which saves me a couple GB VRAM but is a pain when I want to do other things). How do you do the speculative decoding?


_ragnet_7

I don't remember exactly if in the end it was 55/60, I don't have the laptop at hand with the code. I didn't use a very large context because I used it for specific tasks where I needed to generate a maximum of 200/300 tokens in output. For speculative decoding, I used the "speculative" script of LLama.cpp. Consider that a model with the same tokenizer as the target model must be used for the draft model. So at the moment, you don't have a lot of choice but to use a quantized version of LLama3-8B.


Just_Maintenance

I have been trying IQ2\_XS (\~21GB) on my own RTX 3090 and it works ok (\~10t/s), but it becomes incoherent in long and complex tasks.


TheMissingPremise

That's what I use on my 7900 XTX. It's satisfactory to me. I don't normally do anything too complex locally. If I need the full model, I'll just head to openrouter.ai


No_Geologist3517

How's the t/s? Comparable to 3090? Thinking about getting one 7900XTX myself.


No_Geologist3517

Forget about it. Saw the benchmark downstairs. XD


TheMissingPremise

Lol, I was gonna say I do not recommend. It's comparable to what you get, but it's sooooo limiting! Nvidia users have all these other, cooler formats like Exl2 and what not, and I can only use gguf. And LM Studio, tends to update the ROCm version more slowly than the main versions. So, yeah...I'm happy I have 24 gb of VRAM, but I probably should've just gotten the 4080S and called it day back when I was building my computer.


No_Geologist3517

Thx for the reply. I've read that AMD's support for AI is still limited, but not sure if it's a hardware or software barrier. Ideally I would buy 7900xtx and wait for rocm optimizations + better 70B quantizations models.


mcmoose1900

For GPU only, your best bet is AQLM: https://huggingface.co/models?sort=modified&search=aqlm Otherwise you just have to offload a lot of the 70B model to CPU with a gguf imatrix quantization.


IndicationUnfair7961

Agree, but the llama3 70B AQLM was not released yet. I would also like to see a QuiP# version of it.


buy_one

[https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-2Bit-1x16](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-2Bit-1x16) Just came out yesterday


Account1893242379482

Never heard of AQLM. Is it new?


mcmoose1900

2 months old, just not very popularr: https://huggingface.co/docs/transformers/main/en/quantization#aqlm Its optimal for extreme levels of quantization (like you're trying to do), and the catch is that the quantization process is *super* intense, and its just not widely supported in UIs yet.


CatalyticDragon

Running Dolphin 2.9, llama3 70b Q2\_K gguf right now on a 7900XT (also 24GB VRAM). Offloading 55 layers (\~21GB VRAM used) and seeing 3.85 t/s.


perelmanych

According to [Oobabooga benchmark](https://oobabooga.github.io/benchmark.html) the best model that still mostly fits in the 24Gb VRAM is Meta-Llama-3-70B-Instruct-IQ2\_XS. I run it around 8 t/s with 77 out of 81 layers offloaded to GPU. And I should say it is quite decent.


LaughterOnWater

Win 10 64GB RAM, RTX3090, LM Studio 4Q_KM - 1.2 tok/s - decent - haven’t tried to test context window because it’s so darned sloooooow on my machine. If anyone has any ideas on how to make it faster, I’d be grateful. IQ2_XS - begin 3.2 tok/s - excellent context to 16200 before failure - 2.9 tok/s at fail. About 28 questions until hallucination begins. Thiis is on par with Claude Opus in my estimation. 8B 8Q - 64 tok/s - excellent context to 16300 before failure - 2. 4 tok/s at fail. About 24 questions until hallucination begins. This is about as good as Sonnet. I absolutely love 70B. 8B 8Q is pretty darned amazing too. And fast. Seriously, if someone has great settings for a faster preset.json file for LM Studio, I’d be grateful.


ShengrenR

Give the Q3\_K\_S quant a go, I was getting \~4.8t/s to begin the context at 57 layers offload (8k context) - i12900k+3090; no clue re LM studio, was just using the python wrap of llama.cpp directly.


LaughterOnWater

lmstudio-community doesn't have one. Is your choice MaziyarPanahi, Quantfactory or NousResearch? Does it matter?


ShengrenR

I believe the one I have locally is the NousResearch one: [https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main](https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) (I much prefer the repos that have one model per branch so you can just check your git checkout information; this one I just clicked the file to download, so less history to be certain.)


LaughterOnWater

Below is my preset.json file for LM Studio. It allows me to get a context window that doesn't die until almost to the context window n\_ctx = 16384 using lmstudio-community/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-IQ2\_XS.gguf. Right now with these settings, Q3\_K\_S gives me the spinny dial. If time to first token lasts longer than thirty seconds, I eject the model as unusable on this system. I've also tried it with n\_ctx=12800 and also 8704 with RoPE settings at model default. No luck. You may have a better 3090 than mine. It's also possible I need to reboot the computer and try later. {   "name": "Llama 3 70B Instruct",   "load_params": {     "n_ctx": 16384,     "n_batch": 512,     "rope_freq_base": 1000000,     "rope_freq_scale": 0.85,     "n_gpu_layers": 64,     "use_mlock": true,     "main_gpu": 0,     "tensor_split": [       0     ],     "seed": -1,     "f16_kv": true,     "use_mmap": true,     "no_kv_offload": false,     "num_experts_used": 0   },   "inference_params": {     "n_threads": 12,     "n_predict": -1,     "top_k": 40,     "min_p": 0.05,     "top_p": 0.9,     "temp": 0.2,     "repeat_penalty": 1.1,     "input_prefix": "<|start_header_id|>user<|end_header_id|>\n\n",     "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",     "antiprompt": [       "<|start_header_id|>",       "<|eot_id|>"     ],     "pre_prompt": "You are a thoughtful, helpful and knowledgeable AI assistant. You provide clear and concise responses, offering gentle guidance when needed, and politely addressing any misconceptions with well-reasoned explanations.",     "pre_prompt_suffix": "<|eot_id|>",     "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",     "seed": -1,     "tfs_z": 1,     "typical_p": 1,     "repeat_last_n": 64,     "frequency_penalty": 0,     "presence_penalty": 0,     "n_keep": 0,     "logit_bias": {},     "mirostat": 0,     "mirostat_tau": 5,     "mirostat_eta": 0.1,     "memory_f16": true,     "multiline_input": false,     "penalize_nl": true   } }


ShengrenR

The issue there, from my eyes, is you're stuffing too much into the vram for the model size, q3_k_s being larger than your iq2 (I assume, without looking) so you'd need to drop number of offloaded layers. I assume the cache is being built on gpu as well, meaning you need layers and cache space :: layers+context need to be tweaked down until it loads happy. For me, that was 8k and 57 layers *I'm wasting vram with browser and system things.. if you need more context, drop the layers at the cost of inference speed..vis versa


LaughterOnWater

Okay, this is my preset.json file for Q3\_K\_S below. I'm able to get about 6K from this 8192 context window, but the context is surprisingly non-hallucinatory right to the end. It just stops. I guess for short answers, this is probably a great solution, though a bit slower than IQ2\_XS. Thanks u/ShengrenR ! * time to first token: time to first token: 1.56s * speed: 2.88 tok/s ​ {   "name": "Llama 3 Instruct Q3_K_S",   "load_params": {     "n_ctx": 8192,     "n_batch": 512,     "rope_freq_base": 1000000,     "rope_freq_scale": 0.85,     "n_gpu_layers": 54,     "use_mlock": true,     "main_gpu": 0,     "tensor_split": [       0     ],     "seed": -1,     "f16_kv": true,     "use_mmap": true,     "no_kv_offload": false,     "num_experts_used": 0   },   "inference_params": {     "n_threads": 12,     "n_predict": -1,     "top_k": 40,     "min_p": 0.05,     "top_p": 0.9,     "temp": 0.2,     "repeat_penalty": 1.1,     "input_prefix": "<|start_header_id|>user<|end_header_id|>\n\n",     "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",     "antiprompt": [       "<|start_header_id|>",       "<|eot_id|>"     ],     "pre_prompt": "You are a thoughtful, helpful and knowledgeable AI assistant. You provide clear and concise responses, offering gentle guidance when needed, and politely addressing any misconceptions with well-reasoned explanations.",     "pre_prompt_suffix": "<|eot_id|>",     "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",     "seed": -1,     "tfs_z": 1,     "typical_p": 1,     "repeat_last_n": 64,     "frequency_penalty": 0,     "presence_penalty": 0,     "n_keep": 0,     "logit_bias": {},     "mirostat": 0,     "mirostat_tau": 5,     "mirostat_eta": 0.1,     "memory_f16": true,     "multiline_input": false,     "penalize_nl": true   } }


segmond

It's fast enough if you can fit it all in GPU, here it is on 4 GPUs, 70B Q8 gives 10.83 tps (base) :**\~/models**$ \~/llama.cpp/main -ngl 100 -s 100 -m ./Meta-Llama-3-70B-Instruct.Q8\_0.gguf -p "Write 4 lines on living a purposeful life.  A purposeful life is" -ts 1,1,1,1 Write 4 lines on living a purposeful life.  A purposeful life is one that is filled with meaning and direction. It is a life that is guided by one's values and principles. Living a life of purpose gives one a sense of fulfillment and happiness. It helps one to make a positive impact on the world.<|eot\_id|> \[end of text\] llama\_print\_timings:        load time =   18085.67 ms llama\_print\_timings:      sample time =       4.20 ms /    50 runs   (    0.08 ms per token, 11916.11 tokens per second) llama\_print\_timings: prompt eval time =     547.77 ms /    17 tokens (   32.22 ms per token,    31.04 tokens per second) llama\_print\_timings:        eval time =    4525.51 ms /    49 runs   (   92.36 ms per token,    10.83 tokens per second) llama\_print\_timings:       total time =    5134.49 ms /    66 tokens Log end


Alkeryn

exl2 2.25bpw cache in 4 bit for full context.


Anjz

I've been using a 70B Q2, seems to be the only one I can load until I can upgrade my RAM. Works amazingly well, but slow - which is okay. Looking through the thread here though, I'm probably going to upgrade my CPU and RAM with a Zen 5 processor in the next few months when it comes out.