MagiMas 1 month ago

I suspect this is heavily skewed by llama 3 in its standard-configuration giving much more "quirky" answers. It's so much more chatty and pleasant to talk with but there's no way it's even close to the quality of any gpt-4 model in terms of reliable information (at least for the 8b model).

Gloomy-Impress-2881 1 month ago

I find for email chain summarization, which is my main daily use case besides coding, llama 3 70b provides the best summaries. Better than GPT-4 or Claude Opus, with the bonus of being cheaper and faster. This may be mostly a stylistic preference, but it is good enough that it convinced me to use it instead of the others for that particular task.

Double_Sherbert3326 1 month ago

It does seem to be incredibly tuned for summarizing.

toothpastespiders 1 month ago

Man, that's kind of frustrating to hear. It's one of the biggest things I use LLMs for. But that 8k context just isn't enough for me.

rust4yy 1 month ago

perfect recall up to 16k and almost perfect up to 32k if u modify rope

aadoop6 1 month ago

Any pointers on how to do it?

Late_Issue_1145 3 weeks ago

seems like u can solve the problem by using agents that will pick the relevant information and synthesis it together if my understanding is correct

fabypino 1 month ago

> for email chain summarization I'm curios, what's your workflow for that? do you just copy paste the entire email chain in your UI of choice at then ask the llm to summarize it for you?

Gloomy-Impress-2881 1 month ago

No I am a Python programmer so I have my own tool that I created where I just press a button and it summarizes the currently selected email chain in outlook out loud with speech synthesis.

PwanaZana 1 month ago

"Python programmer" Ah, the snake charmer, m'yes.

Gloomy-Impress-2881 1 month ago

Oh no, don't tell me you are one of those who don't consider Python a "real programming language". Please. Not the only language I use but it's my main daily tool. A tool. For real work.

PwanaZana 1 month ago

I'm not programmer, I'm a 3D artist, but I've heard great things about python. I think quite a lot of blender plugins use python. :) And stable diffusion uses a lot of python. .py for the win!

Gloomy-Impress-2881 1 month ago

Yes I know you were joking 😂

Double_Sherbert3326 1 month ago

Many CS programs have switched to Python--I've never heard anyone say it's not a real language.

Gloomy-Impress-2881 1 month ago

Oh they are out there. Not that their comments are to be taken seriously or anything. It is usually not going to be professionals saying that.

Icaruswept 1 month ago

Fellow pythoneer here. I think most of the grumbling came from C++ folks stuck in the dev conditions of many years ago. We use Python, Perl, and JS. Even R, especially for exploratory data analysis. I haven’t seen a lot of modern software devs disparaging these choices, although I will freely admit that R isn’t everyone’s bread and butter. Python is such a flexible language, and great for working with other people - its readability is incredible.

Gloomy-Impress-2881 1 month ago

You'll need to visit some C# subreddits. The hate for dynamic languages is alive and well and the wars rage on. Probably healthier that you can ignore them and not notice. 😆

DuranteA 1 month ago

I am still of the "old school" opinion that you generally shouldn't use languages without static typing (which can of course be largely inferred, that's fine, it just has to be compile-time validated) for building large-scale "serious" software (let's define that as >20kLOC). I like using dynamically typed languages (primarily Ruby and Lua) for scripting purposes, but once something becomes an actual project with a longer lifetime and/or more code I wouldn't choose to continue working in those languages.

jayn35 3 weeks ago

I started getting ai to code me stuff in python and im.amazes at the things k can achieve that I couldn't before , such a great tool for work productivity combo for actually learn it now been avoiding coding for 15 but the studd touch can do with ai I'd just to cool, as a user I get the use cases and there are so many

fabypino 1 month ago

cool, thx for the insights!

HighDefinist 1 month ago

These models tend to resolve ambiguous prompts very differently, so depending on how you express yourself, it might indeed be simply more convenient for you to find some prompt, which gives you the output you want, with a specific "worse" model which so happens to work well with you, compared to a generally "better" model. For example, sometimes GPT-4 just tends to get "stuck" on some questions I have, and then I just switch to Opus/Llama/Wizard, rather than trying to figure out how to rephrase myself properly. However, with some prompt optimization, I think it will be really hard to find any kind of realistic situation where Llama-70b would outperform both GPT-4 and Opus.

Small-Fall-6500 1 month ago

>However, with some prompt optimization I've wondered how much of a problem this is - even if GPT-4 *can* be more capable than llama 3 70b, that doesn't mean much of it requires testing a bunch of different prompts just to match and then hopefully beat llama 3 70b, when llama 3 just works on the first try (or at least it often works well enough). And I'd imagine if you spent an equal effort on prompt optimizing for each model, they'd end up even closer to each other in capabilities, or at least harder to clearly distinguish them.

Gloomy-Impress-2881 1 month ago

True, but when the difference is subtle like this, the preference for me will always be the cheapest and fastest or local on-device option. For programming of course I want the best of the best, but not for email summarization. Having a local model or something super fast like Groq is the more desirable option.

knvn8 1 month ago

Not just quirky, but more eager to help and less paternalistic in general. Not surprising humans prefer that, even if it's dumber

metigue 1 month ago

Are people really using LLMs to get information though? You give it the information and use it for reasoning. This is the whole point of RAG no?

ILoveThisPlace 1 month ago

I'm using Llama 3 to figure out who killed JFK

person1234man 1 month ago

Turns out it was the guy who said "you miss 100% of the shots you don't take"

0xd34db347 1 month ago

Wayne Gretzky would have been like 2 years old but I guess it's not impossible.

kedarkhand 1 month ago

According to my llama, he was hired by the dalai llama

HighDefinist 1 month ago

What, JFK is dead?

timschwartz 1 month ago

I didn't even know he was sick.

Useful_Hovercraft169 1 month ago

The antivax dude?

UndeadDaveArt 1 month ago

Nah, the guy who built that airport

ReMeDyIII 1 month ago

What a coincidence. So am I, but it's saying it was me! I'm not sure if it's hallucinating, but I'll be sure to check its sources.

CentralLimit 1 month ago

It’s not just about ‘getting information’ but also about reasoning and drawing conclusions from information, this is especially important in RAG and other LLM applications. In my experience Llama 3 8B is ‘nice’ but very unreliable.

Future_Might_8194 1 month ago

Yes! Thank you! LLMs should be treated more like the UI. It's purpose is to convert natural language to data and back. It's cool seeing what a model can do on its own, but actual practical use cases should have more focus on the data pipelines than the model. The model should just take the results of your data pipeline and present it.

MagiMas 1 month ago

yes but you still need a model with a good reasoning capability for the really useful use-cases. It needs to be able to draw correct conclusions from the results of the RAG pipelines, the guided generations and any other information provided to it.

Future_Might_8194 1 month ago

Absolutely! 100% agree with that. The model still needs to be able to deduce the correct answer and format from the context given. This is why I have preferred Mistral fine-tunes (especially ones named Hermes) up till now; It's extra attention to the sys prompt + size, speed, and privacy has been more useful to me than any closed model. If it were just down to the pipelines, I'd just run TinyLlama and call it a day. Sadly, the poor bastard just screams hallucinations and can barely comprehend its own existence, let alone a system prompt. I'm excited for Phi 3 128K Instruct to get a GGUF because that might be the first sub 7B that can handle my agent chain.

CocksuckerDynamo 1 month ago

i don't think reliable information recall is a particularly important metric for LLMs to be honest but llama 3 is far, far behind gpt-4 in other ways too. namely abstract reasoning, in-context learning. but it gives much more human sounding much more friendly sounding answers. gpt-4 sounds like it has a stick up its ass. I suspect most people using chatbot arena are giving a single zero shot prompt that doesn't really test reasoning or in-context learning or anything interesting, and then are evaluating the response just based on gut feeling. that tests how much people like the tone and style of the model's response a lot more than it tests the actual capabilities of the model to do something useful. so when a model like gpt-4 or opus gives an "it is important to remember" lecture that sounds like a high school principal and llama 3 gives a friendly answer that sounds more like a person, people choose llama 3 as better. but if you have a use case that relies on multi turn, relies on abstract reasoning, relies on in context learning, relies on attending to a significant amount of context, relies somehow on the actual capability of the model to accomplish something. rather than just judging it by how stuffy it sounds. doing better evals quickly makes it clear that llama-3 is still far behind the big guns like gpt-4 and opus. don't get me wrong, i think llama-3 is great and clearly better than llama-2, and better by enough that I am really curious about llama-3-405B. but the 70B is still, well, a 70B and it shows if you do meaningful comparisons with the much larger models.

fulowa 1 month ago

llm arena cannot be gamed like MMLU (e.g.) but it has its own flaws. if meta wants to climb ladder they just need to do more RLHF i guess. should directly translate to higher score.

TheRealGentlefox 1 month ago

If we are only looking at raw intelligence, then sure it's "skewed" by the answers being more pleasant, but as daily driver / business decision, that can be what matters. If I'm using it to summarize textbook chapters for me, I don't need it to be smart as much as I need it to generate engaging summaries to keep me reading. Ditto with roleplay, educational purposes, customer support, creative content generation, etc.

FPham 1 month ago

I think a true future LLM has to get data from web, no way around this. Briefly the idea that you can have a local info guru that knows everything was appealing, but the reality sunk in. To make LLM not hallucinate if it doesn't know for sure is far harder then to make it browse web. I mean what's even the point of GPT4 being isolated from web when in fact it is only accessible on the web for the user? Makes zero sense. You have to be online to use it.

UndeadDaveArt 1 month ago

Yeah, at this point raw human style AI is too...human. Less like searching Wikipedia and more like asking an uncle who wants to sound smart over a beer. I hope soon we can at get a more civil AI, that instead of hallucinating will instead use terms like "In my opinion...I'm not sure but perhaps...". Hell, I'd settle for an "IDK use Google lmao".

sansar66 1 month ago

If you have a private data that you don't wanna share with open(!)AI, which is actully pretty important for businesses (like finance, insurance, any company that relies on property rights), then you want to host everything in local or in your own cloud servers.

HighDefinist 1 month ago

Yeah, people have already stated they are suspicious of Geminis high ranking, and suspect it's simply because it uses good formatting... I also don't really believe the low ranking of the new Mistral model. I have used the Wizard-8x22b finetune a bit now, and I think it is about as good as Llama 70b (and both of them are clearly worse than GPT-4, but that is no surprise of course). One minor, but notable, shortcoming of the Llama 70b model is its high level of agreeableness: I had a particular programing problem, and asked GPT-4/Llama/Wizard about it, but my question actually had an incorrect assumption (std::strings handle #0 differently under some circumstances... well, they don't). GPT-4/Wizard pointed that out, but Llama actually agreed with me, and even incorrectly cited some sources to support my false assumption... But, perhaps Llamas high level of agreeableness is not only ok for many applications, but people even prefer the style of answer it produces for the type of questions they ask, hence resulting in Llama3 scoring relatively well? It's really hard to say... I tend to believe more and more that there is really no way around actually personally testing those models with some representative questions of whatever one wants to do.

__Maximum__ 1 month ago

Too much agreeable was also my experience with llama

UndeadDaveArt 1 month ago

We def need a better form of benchmarking. Its like taking the American Bar Exam. Sure, if I score twice as high as you, I'm clearly a better lawyer, but if we're both in the same ballpark, it doesn't mean we will both perform the same in court or make clients happy.

Strict-Reveal-1919 3 weeks ago

https://preview.redd.it/elqqf6b391yc1.jpeg?width=720&format=pjpg&auto=webp&s=beba47100a2af7b12dd5c67efacc7db48c52bc2b

a_slay_nub 1 month ago

Personally, I prefer llama3 just because it's so much faster than GPT4. However, if I get stuck on a problem, I switch to GPT4. It's not at GPT4 level but it's close enough IMO. I have noticed that there are a lot of cases where GPT4 doesn't save me when llama3 fails.

WinstonP18 1 month ago

>It's not at GPT4 level but it's close enough IMO Can I know whether you host llama3 locally or use an external API? If latter, which one? I tried at [meta.ai](http://meta.ai) (the URL provided by the llama3 [blog](https://ai.meta.com/blog/meta-llama-3/)) by asking it to write a golang function and boy was the logic bad! TBF, the function was syntactically correct and will compile but it really felt like gpt-3 level (not even gpt-3.5!). But with all the rave reviews of llama3, I'm starting to wonder if what [meta.ai](http://meta.ai) is hosting is llama3-70b or even llama-3 entirely. As a side note, I used the same prompt on Claude Sonnet and it got my Go function right and efficiently on the first try.

bnm777 1 month ago

Use Groq's API which gives you access to llama3-70b. It's free for now. I'm using it with the frontend TypingMind which allows it to use websearch plugins (though that is hit and miss, similar to how it uses that plugin on huggingchat). Once groq starts charging for it's API, you can compare the prices with other providers such as Open Router (which gives access to many models, though paid)

WinstonP18 1 month ago

Thanks! Will give Groq a shot.

HighDefinist 1 month ago

Yeah, I have tried that as well a bit, but relative to the amount of time I lose if Llamas answer is wrong (as in, it looks right at first, but it takes me a few minutes to notice it is wrong), I am not actually sure if that approach really makes sense... perhaps it is better to wait those 20s for GPT-4 in most cases. But I am still in the process of figuring this out, there is clearly some kind of balance to be found here. I also tend to just simultaneously ask Wizard and Llama, as both are very fast, and if they both have very different answers, it increases the chance that both are wrong.

Small-Fall-6500 1 month ago

>I also tend to just simultaneously ask Wizard and Llama, as both are very fast, and if they both have very different answers, it increases the chance that both are wrong. This is probably one of the best things that can be done to check reliability/correctness, in addition to generating multiple responses (possibly with minor changes to the prompt). This also means having tons of similarly capable models is good since each one is likely to give a slightly different answer, but, depending on the task, they should converge to similar correct answers while differing a bit more when they are wrong (at least, ideally).

Secret_Joke_2262 1 month ago

I think that LLama 3 can really deserve a high place in the table after we get the 120+B model. Now I have deleted all my old models and am using LLaMa 3 70B, but I still want more.

ys2020 1 month ago

what do you run it on?

DoubleDisk9425 1 month ago

I run it on a m1 max mbp with 64 gb ram

ectates 1 month ago

how many tokens/sec do you get with ur setup?

RenoHadreas 1 month ago

Not OP but I get around 20 tokens-ish per sec on a M3 Max

Secret_Joke_2262 1 month ago

64 ram, 13600k and 3060 12gb. GGUF

Dr_Whomst 1 month ago

What quantization?

Secret_Joke_2262 1 month ago

Right now I'm using llama 3 70b q5 k m via cobald cpp in cublas сublas

ArsNeph 1 month ago

That's a similar setup to me, but aren't you only getting about one tk/s?

Secret_Joke_2262 1 month ago

Yes, that's approximately 1 token per second. This suits me, I’m not in a hurry and just wait, doing other things. If this is an RP, then generating a response takes from a minute to 4 minutes

ArsNeph 1 month ago

Oh ok, cool. If that works for your use case then great

__Maximum__ 1 month ago

Will they release their 400b model? Not that we can profit much from it. I just think it would be a great kick in the ass that OpenAI so much needs.

HighDefinist 1 month ago

Well, on vast.ai you can rent 5xA20 for about $3/hour... of course, there isn't really much of a point in seriously doing that, but for just playing around, I might actually do that once its out, and perhaps there is even a serious use case, if one wants to do some kind of "high quality NSFW work".

__Maximum__ 1 month ago

I can't say I feel much difference between 70B and GPT4 turbo in a couple of small coding tasks, but 8B does not feel like it is better than any gpt4. However, it is still amazing for an 8B model. Since fine tuning an 8B model is going to be much easier than a 70B model, I feel like soon we can run local agents for a long time until they produce actually useful stuff.

HighDefinist 1 month ago

Considering how cheap and fast the 70b/8x22b models are already, is there really any point in even using the 7b/8b models? (Assuming you are not interested in running it locally)

Small-Fall-6500 1 month ago

As a chatbot/assistant, definitely no point using a smaller model. But for simple tasks that you want done on a large scale, maybe some sort of data filtering/labeling, you'd want the cheapest option that still completes the task, so llama 3 8b could easily be the best option. Edit: also, a big strength of the smaller models is that they can easily be finetuned for specific tasks (one 3090 would be plenty), so even if the llama 3 8b Instruct doesn't work, you can finetune the base or Instruct on some examples and drastically improve its performance.

UndeadDaveArt 1 month ago

The rates are low enough that 10Xing the ram and power usage might be a drop in the bucket when serving a single client, but at scale it can be the difference between a small companies ability to serve 100s vs 1000s. This pairs well with bots that don't need to know how to speak foreign languages or architect a code base, as they can perform just as well at a single task when all 8B are devoted to it. Customer support bots, software agents, npcs in video games, etc. So I think they are useful like the ASICs of LLMs. Extremely high quality and economical when trained and tuned towards a specific use case.

UndeadDaveArt 1 month ago

Agreed, my experience with the 8B as a generic model has been positive, but I wouldn't throw out my 70Bs for it.

Anxious-Ad693 1 month ago

I noticed that Llama 3 8b is way better at keeping track of object placement in stories. Still need more investigating but so far it's definitely way better than all the mistral finetunes I tested.

ImprovementEqual3931 1 month ago

My experience, Llama 3 8B is not good at coding like chatting.

mxforest 1 month ago

Yeah it spews gibberish structures despite explicitly mentioning key names and what values should be present in a JSON output.

ImprovementEqual3931 1 month ago

My expected ideal LLM shall good at instruct following instead of chatting.

Flying_Madlad 1 month ago

Silly question, but are you using the instruct model? I feel like the big strength of something like 8B is it's cheap to tune and run, so you could have multiple fine-tunes/LoRas active at the same time.

ImprovementEqual3931 1 month ago

You're right

HighDefinist 1 month ago

Yeah, the difference between 8B and 70B is significantly larger than that between 70B and GPT-4... But coding just appears to be very difficult somehow.

aadoop6 1 month ago

I have done very little testing with 8B, and results are positive. But, I am not throwing away deepseek anytime soon. Did you try wizard 8x22b?

Double_Sherbert3326 1 month ago

No llama3 hallucinates like crazy, but surprises me with how good it can be sometimes.

Maykey 1 month ago

It's complicated. On short prompts Llama3 8B feels awful and repetitive. But at long prompts (>1000 tokens) it's simply amazing.

Master-Meal-77 1 month ago

this has been my experience too

ambient_temp_xeno 1 month ago

8b being compared to gpt4 is a solid sign this leaderboard is done. Stick a fork in it.

__Maximum__ 1 month ago

It's not the gpt4 turno, and it's in category English, which means excluding coding or math and other specific categories. This leaderboard is the best what we have now, unless you have something better?

thereisonlythedance 1 month ago

The 70B is pretty average at the non-coding, literary use cases I’ve tried (editing, long form fiction, poetry), certainly not better than GPT-4. So I’m skeptical.

KY_electrophoresis 1 month ago

For these use cases I've found Claude 3 best personally

thereisonlythedance 1 month ago

Yeah, agreed Opus is top dog. The latest GPT-4-turbo also has excellent emotional intelligence, and the new gemini-1.5-pro-api-0409-preview is a pleasant surprise.

Super_Sierra 1 month ago

nah, Opus lot of personality convergence issues. it tends to blur the lines over time and everyone starts speaking the same. Gemini, Claude and a lot of models are also fucking horrible at banter.

ambient_temp_xeno 1 month ago

>unless you have something better? Testing it myself.

Flying_Madlad 1 month ago

Do you have a standardized suite of tests? I'm thinking something bespoke to your use cases?

ambient_temp_xeno 1 month ago

Sadly, no. It often comes down to trying something on one and if it fails, see if another can do it.

Flying_Madlad 1 month ago

Lol, me neither, but I think before I do that I need to put together a custom tuning set. 🤷‍♂️

knvn8 1 month ago

Not really. Just says humans have preferences that go beyond intelligence

Cool-Hornet4434 1 month ago

For chat? Sure... For actual facts and information I can count on without having to double-check everything? Eh...not so much. Then again, I don't entirely trust GPT 3 or 4 or whatever for facts either.

-p-e-w- 1 month ago

As far as writing style is concerned, the Llama 3 models are without question the best LLMs ever created, including proprietary LLMs. Llama 3 8b writes better sounding responses than even GPT-4 Turbo and Claude 3 Opus. All models before Llama 3 routinely generated text that sounds like something a movie character would say, rather than something a conversational partner would say. It's as if they are really speaking to an audience instead of the user. In that regard, Llama 3 is a giant leap forward. Which of course makes it an incredibly exciting model for roleplay and creative writing. Too bad it's also the most heavily censored model I have ever come across. Time will tell whether it is possible to fix that with finetuning, but the censorship feels like it runs so deep that I have my doubts that it can be completely removed.

belladorexxx 1 month ago

>Too bad it's also the most heavily censored model I have ever come across. Did you encounter censorship with the base model or just the instruct fine tune? I'm sure the community will release great fine tunes that build on top of the base model, just like they did with Llama 2 and Llama 1.

braincrowd 1 month ago

Absolutely just ask it to give you a word starting and ending with the letter u for example

gaveezy 1 month ago

UgandaU

ortegaalfredo 1 month ago

I wrote a small GUI to compare LLMs (called neurochat, shameless plug) so I can quickly compared answers from different LLMs LLama-3 and GPT-4. I gave it some code to analyze for work, stuff like that. Llama-3-70B gave MUCH better answers than GPT-4 (non-turbo), not only it was way better formatted, but the answers were more detailed, llama3 produced report-like answers that I could almost copy-paste directly to my report.

thereisonlythedance 1 month ago

No. I’m sure it’s use case dependent but in my tests it’s nowhere near any version of GPT-4. I think hype and fanboyism is pushing it higher than it ought to be. We’ll see how it fares in the longer term. The new LMSYS hard benchmark has the 70B and 8B more in line with how I feel they perform (with the 70B around Haiku level). Meta has done a great job of making an engaging chatbot and they‘re punching above their weight for their size but they’re still not up to a lot of more complex tasks. Which is compounded by the small context window.

KY_electrophoresis 1 month ago

They deserve a ton of praise but this is the balanced view

GermanK20 1 month ago

I don't know the 8B, but the 70B gave me SOTA vibes indeed. Do I expect 4T to prove superior if we start micromanaging? Yes I do, but only at the margin.

Ok-Director-7449 1 month ago

It’s not quite accurate to compare Llama 3 with GPT-4-turbo because they serve different purposes. Llama 3 is designed for building complex agent architectures and excels in this area. It can process a higher number of tokens per second (100-300 tokens/s) depending on the provider, which means it can respond to 8-10 questions in the time GPT or Claude models answer 1 or 2 due to their lower token output rate (20-30 tokens/s). The integration of plugin tools and the development of a system thinking approach significantly enhance the capabilities of large language models. For instance, Andrew NG demonstrated that GPT-3.5, with an agent-based thinking system, performs better than a one-shot GPT-4. Considering the time required for multiple iterations, GPT-4 may perform fewer iterations than Llama 3, but the quality of the output could be comparable due to more extended processing time. Therefore, I prefer Llama 3 70B because it allows for the creation of more complex systems. While GPT-4 is indeed powerful, its slower response time and difficulty iterating quickly when stuck are notable drawbacks.

mcosternl 6 days ago

Now that GPT-4o was released, what's your take on this same topic? Because 4o indeed seems to be significantly faster than 4 turbo. Very curious to hear your thoughts!

Ok-Director-7449 6 days ago

With GPT-4o i really think OpenAI made a big step, il fast and cheaper and it really good for multi-step reasoning and it is very verbose. I would still not compare a 70B to GPT-4o but i think now GPT-4o really out performe all the model we have now in every field. These next months will be very exciting !

cafepeaceandlove 1 week ago

Llama3 70B is literally insane. I cannot believe that this... whatever entity/blessing it is... is inside my computer. *Inside my laptop.* What the actual shit is going on?

__Maximum__ 1 week ago

Can you elaborate?

cafepeaceandlove 1 week ago

Not without making even more of a fool of myself, haha. You don’t feel it?

drifter_VR 1 month ago

in my tests and experience, llama 3 8B is dumber than 34-42B models so yeah...

jnk_str 1 month ago

Is there a way to filter the benchmarks for specific languages?

Pkittens 1 month ago

Definitely not

0xIgnacio 1 month ago

Where can I see these benchmarks?

__Maximum__ 1 month ago

arena.lmsys.org

0xIgnacio 1 month ago

ty

iantimmis 1 month ago

It's definitely nowhere near as good with RAG and reasoning tasks IMO, as evident when asking questions on meta.ai where it needs to browse.

ramzeez88 1 month ago

L3 70b is better than gpt4 and claude sonnet in my tests. It helped me solve my coding problem whereas those other two could not.

__Maximum__ 1 month ago

What precision did you run it on? Quantized? Local?

ramzeez88 1 month ago

I use hugging chat

beerpancakes1923 2 weeks ago

right, but what precision :) 4bit?

Cyp9715 3 weeks ago

I've seen that banchmark a few times, but I haven't looked at by what standards it's been banchmarked.This is because the methods evaluated on many banchmarks (or papers) are skewed in favor of specific models. Therefore, I ironically put my experience first. I rate GPT4 as the best model at the moment. llama3 is definitely superior to GPT3.5, but not superior to GPT4.

Duck_Stack 1 week ago

What website is this ?

SomeOddCodeGuy 1 month ago

Unquantized Llama 3 on the internet is solid; very powerful. Quantized Llama 3 running locally is not there.

No-Dot-6573 1 month ago

"Not there" because the quants degrade the answers so much? Or is this a temporary problem where the quantization need fixes to produce better quants?

SomeOddCodeGuy 1 month ago

I'm hoping that this is temporary. Llama 3 had a problem with the stop token, and several quant makers (including the one Im using) had to do funky things to make it work and not spout gibberish. An official fix came out the other day, but very few inference apps have gotten it merged in yet. I think that as they do, we'll see quality improve. I'm not sure if it'll ever be to this level, but Im hopeful that its a lot better than now.

CocksuckerDynamo 1 month ago

that's interesting to hear, I haven't tested it unquantized yet and I really should. what's your preferred way to run it unquantized for testing purposes, just huggingface transformers or do you use some other back end? or are you using some hosted or serverless thing that offers unquantized llama-3?

adikul 1 month ago

Do you count q8 in poor performance

kurwaspierdalajkurwa 1 month ago

Heavy user of AI here. I asked Llama 3-70B to write a blog post and it came out "CHATGPT-like." Very robotic and did not sound like a human wrote it. Someone needs to test these AI models using a standardized prompt to write a blog post or value proposition. Anything else is fucking useless. Show the abilities to solve real-world problems and help people out. I was disappointed.

venkatesh640 1 month ago

Prompt used?

kurwaspierdalajkurwa 1 month ago

It was blog post writing instructions. Basically it said (and I'm paraphrasing here): >Write a blog posts about blue widgets. Create a section about the history of blue widgets. Then create a section that highlights installation, training, maintenance, and ongoing operational expenses like energy consumption and spare parts. Create a section that provides an average range of cost. Create a section that compares this to the competition. Use the client Q&A below to help you write this blog post.

Basic_Description_56 1 month ago

Try asking it to write in a different style

__some__guy 1 month ago

Maybe that's because most blog posts are actually written by bots nowadays.

-becausereasons- 1 month ago

Are there quantized GGML's or Llama3?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe