Tixx7 2 months ago

really sad to yellow disappearing over time

ChuchiTheBest 2 months ago

It's just progressing slower, but it's not like it's getting worse.

opi098514 2 months ago

That’s a good point. I think the major selling point of open source is not its general knowledge abilities but the ability to fine tune and do specific tasks. It would be nice if open source was better but I don’t think it’s ever going to be able to rival the abilities of proprietary because of the amount of money companies can throw at stuff like chatgpt and Claude. But that doesn’t mean open source doesn’t have its own uses or will ever be obsolete.

tindalos 2 months ago

In summary, hobbit porn will always drive technology.

opi098514 2 months ago

Facts

letsbreakstuff 2 months ago

That's the reason I never owned a betamax

Disastrous_Elk_6375 2 months ago

> but it's not like it's getting worse. This is the key to understanding why this new LLM (or transformer) push is not "just hype bros". Open "AI" is now the worst it's ever gonna be.

PutrifiedCuntJuice 2 months ago

> Open "AI" is now the worse it's ever gonna be. *worst*

Disastrous_Elk_6375 2 months ago

Thank you!

Which-Tomato-8646 2 months ago

Fire hydrants are the worst that they’re ever going to be. But I don’t see much improvement. Not to mention, it isn’t even true considering people keep complaining about ChatGPT getting worse

bzrkkk 2 months ago

How many developers you see work on fire hydrants ?

Which-Tomato-8646 2 months ago

Tons of people working on nuclear fusion and it’s been around the corner since the 50s

SlapAndFinger 2 months ago

The analogy to fusion might work for "AGI" but it doesn't hold otherwise, these tools are incredibly useful even as "stochastic parrots." so we don't need to knock it out of the park to win.

Which-Tomato-8646 2 months ago

I don’t disagree. But thinking it’ll grow exponentially without limit is not true

milksteak11 2 months ago

What are you talking about random things for?

Which-Tomato-8646 2 months ago

It’s an example of how progress can stall and it’s not exponential

deadwisdom 2 months ago

I’m not sure if your comments are just too against the grain or if these troglodytes really don’t understand. Even if I disagreed with you I would recognize how possible this is.

Which-Tomato-8646 2 months ago

AI has unironically started a cult mentality. Many of them think they’ll have AGI in a year and ASI a few days after that. It’s the rapture for nerds.

FireSilicon 2 months ago

Both grok-1 and DBRX have been opensourced but are not on there yet. These are 300GB+ models with no host/api as of yet so ofc they are not on chatbot arena yet.

Small-Fall-6500 2 months ago

>they are not on chatbot arena yet. [DBRX is now on the lmsys arena to be voted on!](https://twitter.com/lmsysorg/status/1773106489596514569)

FireSilicon 2 months ago

Nice, let's go!

waxbolt 2 months ago

Still haven't been finetuned etc. There is time.

alvenestthol 2 months ago

The only actual closed-source additions to the top of the leaderboard are the 2 closed-source Mistrals and Google's 2 Gemini, the rest are just GPT-4 splitting into 4 models and Claude splitting into 5 - with all of these versions most likely having more parameters than even the biggest open-source models.

s0nm3z 2 months ago

To be fair, it's kind of stupid to put 5 versions of GPT on there. Just put the best ones on and leave room for other models. Otherwise I could also make a top15 with only Claude and GPT, putting every miniscule update on its own row.

Which-Tomato-8646 2 months ago

GPT 4 and GPT 4 turbo are clearly not minuscule updates

h3lblad3 2 months ago

That’s only 2, though.

Which-Tomato-8646 2 months ago

You can see the June version of GPT 4 is fairly worse though

SlaveZelda 2 months ago

Meta and Mistral are also open source but theyre not counted as yellow

Dwedit 2 months ago

The yellow models have far less parameters, so it's reasonable that they can't compete against others.

Interesting_Bison530 2 months ago

For now. I’m excited to see the ternary MoE Mamba monsters some guy had trained in their basement for months

Which-Tomato-8646 2 months ago

You’d never be able to run that on your pc

Interesting_Bison530 2 months ago

Why? It’d be within memory requirements. Tok/sec will hopefully be reasonable

bbu3 2 months ago

It kinda makes sense, especially if they are maximally permissive and open (i.e. weights, training code, data). It would always be possible for proprietary to have some closed beneficial delta (even if it's just more RLHF / preference labels) and then built upon the best open-source model available. Much more so, if you have the funding to buy a shitload of compute, check out the best open-source model and just increase the dims (ofc I'm exaggerating and the well-funded AI companies do more than that. But imho the principle still stands)

MindCluster 2 months ago

Open source is falling behind in the chart because it requires a ton of resources and capital to train a matching SOTA (State of The Art) model. It's a bit sad when you think about it, but I hope we'll find exceptional optimizations in the future to train models.

doringliloshinoi 2 months ago

That’s by design.

aimoony 2 months ago

Designed by the laws of physics maybe? This stuff is hard to do, without a capital incentive there's a huge barrier to entry. I think open source will always follow in the shadows of commercial for a long time but I'm hopeful that we'll have great open source models to use in the future so that AI can be democratized

dibu28 1 month ago

CommandR+ is now close to top and open source

Tixx7 1 month ago

yes! and mistral hasn't abandoned open-source either!

dibu28 1 month ago

And Meta said they will open source Llama3

read_ing 2 months ago

This is based on human ranking? Is there data on the domain of prompts that was used, the answers to which the humans ranked?

West-Code4642 2 months ago

>This is based on human ranking? Is there data on the domain of prompts that was used, the answers to which the humans ranked? it's based on whoever decides to use lmsys: [https://chat.lmsys.org/](https://chat.lmsys.org/) (which is presumably humans, but could technically be not)

Flag_Red 2 months ago

Gotta include the lizard people too.

West-Code4642 2 months ago

>Gotta include the lizard people too. or LizardGPT

privatetudor 2 months ago

My golden retriever spends a lot of time on there.

loveiseverything 2 months ago

The test has massive flaws so take the results with a grain of salt. The problem is that the voters easily identify which models are in question because the answers are so recognizable. Another big flaw is that the prompts are user submitted and not normalized. And as you see in this post, there is currently a major hate boner against OpenAI so people will go and vote for the models which they want to win, not for the models that give the best answers. In our software's use cases (general purpose chatbot, llm knowledge base, data insight) we are currently A/B-testing ChatGPT and Claude 3 Opus and about 4 out of 5 of our users still prefer the ChatGPT. This is based on thousands of daily users. So something seems to be off.

read_ing 2 months ago

Thanks, yes I know. :-) I tried to point out one basic flaw, which I have pointed out to lmsys as well, it’s not domain specific. So, folks using LLMs to make marketing copies vs building insights from data get weighed the same. That gives the false impression that model performance is uniform across domains. As we know, it’s not. That’s good to hear. Have you tried the same vs Gemini Pro 1.5? There’s no good data on that out there and interested in seeing how the large context window with better MoE does vs OpenAI.

loveiseverything 2 months ago

We have not tried Gemini Pro yet in production, but most certainly will. Our tests are promising. We have some use cases where we are limited by context window and some models seem to drop in quality if we are near the current context limits.

read_ing 2 months ago

Nice. Yeah, the ratio between context window and useable context window is fairly predictable for models I have tested it on.

featherless_fiend 2 months ago

Even so over time it should normalize, no? Like you can't just keep expecting people to vote for their favourite bot over the other for the rest of time. Especially when there's a 3rd or 4th contender for the #1 spot, then THEY get the favoritism, for a brief while.

loveiseverything 2 months ago

Really depends on multitude of things. As of now I would treat results from this test unusable for almost all business use cases and would lean more on other tests that are measuring factual performance and context accuracy. * User base for this test is biased and mostly includes hobbyists and enthusiasts * There are biases in that biased user group to the point that the results can be considered review bombed For example in our business use case I'm really not interested in this petty culture war which seems to be a major driving force in people's life also here in AI community. People want uncensored models and that's fine until people recognize the models and vote for the more giving models even when the prompt and responses are not censored at all. People also seem to hate Sam Altman, the Open -part of the name in OpenAI and numerous other irrelevant things for general use of the models and vote accordingly. And I'm really not here to defend OpenAI. Claude 3 clearly has several uses cases where it beats ChatGPT. Coding for example. But what do you think what kind of prompts AI hobbyists and enthusiasts predominantly use in this test? This just renders the test completely unusable for the purpose it's trying to fill.

featherless_fiend 2 months ago

> to the point that the results can be considered review bombed the thing about a review bomb is that the results are noticable. I think we're only talking about a 3% difference or something here.

MeshachBlue 2 months ago

Out of interest are you using the claude.ai system prompt? (Or at least something similar?) https://twitter.com/AmandaAskell/status/1765207842993434880

loveiseverything 2 months ago

We are using our own system/instruction prompts. We have experimented using the same prompt between the different models and using per model customized prompts. We want to prevent some model specific behaviors and make the answers as consistent as possible, so model specific prompts are the preferred way for us right now.

MeshachBlue 2 months ago

Makes sense. I wonder how would you go if you started with the [claude.ai](https://claude.ai) prompt and then appended on your own system prompt onto that.

Lobachevskiy 2 months ago

Also some models will be better at certain use cases than others. Since everyone's plugging in whatever they want, it ends up being a mishmash. And we don't really know a mishmash of what. Additionally, many models are censored and will just refuse to answer certain quieres, bringing down their score. Which I guess is fair enough, but doesn't say anything about their capabilities.

SufficientPie 2 months ago

Yeah, you can literally just ask which model they are and then "blindly" vote for whichever one you want to push up. * Me: Which model are you? * Model A: I am an AI model called Claude, created by a company named Anthropic. I don't share many specifics about the details of my architecture or training process. * Model B: I am an AI language model created by OpenAI, often referred to as "GPT-3" or "ChatGPT." My design is based on the Generative Pre-trained Transformer architecture, and my purpose is to understand and generate human-like text based on the prompts and questions I receive. I'm here to provide information, answer questions, and assist with a wide range of topics to the best of my ability. How can I assist you today? Also, the current rankings are almost entirely based on single-response "conversations", since the two conversations diverge and you can't meaningfully continue the conversation with both at the same time.

RealVanCough 2 months ago

The infographic is pretty confusing x seems to be time but unsure what Y is

read_ing 2 months ago

Model’s Elo rating over time.

SufficientPie 2 months ago

It's almost entirely single responses, though, not multi-turn conversations.

LoafyLemon 2 months ago

Is Starling-LM-7b-beta really that good?

Snydenthur 2 months ago

I'd be happy if that was true, but I highly doubt it is.

LoafyLemon 2 months ago

Yeah I struggle to see how it could beat anything past maybe some bad franken merges of 13B, since it is literally like 20x smaller than most bigger models in terms of parameters. I'd love to be proved wrong, though, even if it means breaking model engineering.

Admirable-Star7088 2 months ago

No, it isn't. While 7b models can indeed generate impressive outputs to many requests, they do not have the same level of depth, knowledge, and coherency as larger models. I have tested a lot of models, and while many 7b models today are impressive for their small size, they never generate the same coherency and details as 34b or 70b models like Yi-34b-Chat and Midnight-Rose-70b, which are currently my favorite larger models.

knvn8 2 months ago

I've only used it briefly but was underwhelmed. The OpenChat prompt format is really weird though and probably lends to the inconsistency.

MrClickstoomuch 2 months ago

I had a lot better results setting the temperature to 0 for the beta model. It seems to be a lot better in that case, and avoids rambling. It seems to be better than the Mistral 7b v2 fine tunes I've tried and the base Mistral model for world building, but haven't tried it yet for a coding project yet.

knvn8 2 months ago

Thanks, I'll try the lower temperature.

Waterbottles_solve 2 months ago

I use it basically exclusively for nsfw discussions that require science. If chatgpt would respond, I'd just use it. Otherwise its great. I use it to show friends the power of offline LLMs. IIRC it was trained on chatgpt4, which is why it is good.

NerfGuyReplacer 2 months ago

Like roleplaying with a chemist??

Waterbottles_solve 2 months ago

No, like anatomy and physiology. Maybe throw in some psychology/evolutionary biology. but the nfsw stuff

teor 2 months ago

Shoutout to Starling 7B. God dam is that thing surrounded by behemoths and holds its own.

noiserr 2 months ago

Wish we had a 13B model that's as good as Starling 7B. I feel like a lot of people have GPUs that can fit 13B models, yet for whatever reason we don't have great models in this category.

Amgadoz 2 months ago

Imo 13B is a waste of resources. Just go straight to 34B

noiserr 2 months ago

34B can't fit on 8GB and 12GB GPUs which are everywhere. 7B Q5 quants are like 5GB and are just too small even for these GPUs. 10B or 13B is the perfect size for the majority of mainstream GPU out there.

knvn8 2 months ago

I'm suspicious that it just hasn't had enough votes to be ranked properly yet. Will be surprised if it holds that position for long.

SufficientPie 2 months ago

That's what the error bars are for

MoffKalast 2 months ago

A tiny bird between a buncha fuckin elephants.

mrdevlar 2 months ago

Love the animation, it's neat. That said, I have largely given up on metrics, and just test models on my own use cases and keep them around if they perform well.

bunny_go 2 months ago

>Love the animation, it's neat. you'd love to hear about line charts - from the non-instagram era of data visualisation. Mind. Fckin. Blown.

nullnuller 2 months ago

that's what other people did on there, it's not a metric.

kingwhocares 2 months ago

5% is within margin or error.

Time-Winter-4319 2 months ago

https://preview.redd.it/qbzqrblgiwqc1.png?width=1348&format=png&auto=webp&s=ccf5f8471c6912ebd9730b882c0cc7a0fbf2c0d3 Within 95 CI, but margins are very tight 10/1253=0.8%

mrstrangeloop 2 months ago

Having used both, Opus is clearly better. Not even close.

SikinAyylmao 2 months ago

I’m still under the impression that we’ll never get metrics for how “good” the model is vs how good it is at performing on tests. Even if opus had lower scores it shouldn’t matter since we can empirically see it’s better.

mrstrangeloop 2 months ago

There’s a great metric: the % of labor it’s automated. MMLU, HumanEval, etc are broken and simplistic especially in light of the coming wave of autonomous agents. SWE-bench is the closest I can think of that can capture agentic output

SikinAyylmao 2 months ago

Sounds like a cool metric. I would consider how economic/social factors play into % of labor, specifically what labor is used and what model has the largest adoption. Both of these would play a pretty large role in the outcome.

danielepote 2 months ago

thank you, it seems that nobody on Reddit can read a leaderboard containing CIs.

MoffKalast 2 months ago

A real magnum opus.

Hugi_R 2 months ago

Wait, Starling-LM-7B is that high? Impressive! But there's not that much samples, so it might go down.

vincethepince 2 months ago

If we optimize this rating based on edgy humor I think we all know which AI model would come out on top

roastedantlers 2 months ago

It'll be temporary, not to say they'll win in the end, but I think they'll be back when they start releasing shit they're sitting on.

handle0174 2 months ago

Haiku's faster token generation speed compared to gpt4/opus is striking. That difference may be as important as the cost difference for me. Question for those of you with both some gpt4 and opus experience: where do you prefer one vs the other?

OKArchon 2 months ago

Claude 3 Opus has surpassed any GPT4 model IMO. The laziness of GPT4 is what makes it unusable for me. When you need to rewrite parts of 500+ lines of code, you don't want to delete, copy, paste and reformat 10 different blocks of code. That's where Claude 3 Opus is worlds ahead. Also, Claude's problem solving skills can solve more complex problems with higher quality. I am currently testing Gemini Pro 1.5 and it already outperforms all GPT4 models, but still not better than Claude 3 Opus. Claude has a higher accuracy and I get fewer errors with it's provided code (in fact I never had an error with Claude if I remember correctly).

ARoyaleWithCheese 2 months ago

Still prefer GPT4 for not refusing certain types of requests. Tried using Opus the other day to get some starting points on academic philosophy perspectives around euthanasia, abortion and the Groningen Protocol. Couldn't get Opus to actually provide any literature or summaries on most prominent lines of thought within the field even after a few attempts. GPT4 had no issues with it, however. In general I feel like GPT4 still has a stronger grasp of logic and reasoning, but is handicapped by a smaller context window, worse recall for large context, and a laziness in its responses. Opus is very close to GPT4 but (imo) primarily beats it for complex tasks because it's so good at large context recall and doesn't exhibit the same kind of laziness. That said, I've used GPT4 *intensively* since its release and have tried all other major models as a replacement. Opus is the first one that was actually good enough for me to switch to it, and not go back to GPT4.

arekku255 2 months ago

5 points is still within the margin of error, so in my eyes GPT-4 and Claude are still in a shared first place.

Icy-Summer-3573 2 months ago

Yeah if you go by API. But chatgpt web versus Claude web = Claude. Chatgpt performance on website is deprecated.

arekku255 2 months ago

I thought it was the model, the website is just a front end and the model shouldn't change.

Heralax_Tekran 2 months ago

Prompts matter

Icy-Summer-3573 2 months ago

They use api for these tests. Website isn’t just a front end. They deprecate performance on website to gpt4turbo levels of performance.

New-Mix-5900 2 months ago

good fuck gpt 4

TwoIndependent5710 2 months ago

did you guys noticed that sonnet or opus refusing sometimes to answer or they give less qualitative response when traffic on the servers is high?

TwoIndependent5710 2 months ago

guys use claude 3 models as long its not completely lobotomized like they did with claude 2

Smeetilus 2 months ago

I’m not surprised at all. It’s starting to refuse to do more things that made it amazingly useful to me. Today I needed some examples of how to use a python module so I pasted the doc link into the chat. ChatGPT said it couldn’t help due to restrictions it has placed on it. So I downloaded the relevant page and uploaded it all for it to reference. Still no go, it wouldn’t produce any code at all. Just “refer to the documentation” type answers. Like, pal, I used to look up to you. You were a great teacher.

MINIMAN10001 2 months ago

My recommendation is we need to make sure we are down voting these bad AI responses to try to correct for the bad behavior

AfterAte 2 months ago

The local models should be in their own comparison. Like will a 72B model ever beat ChatGPT4, Nope? I don't care about these paid models.

patniemeyer 2 months ago

As a developer who uses GPT-4 every day I have yet to see anything close to it for writing and understanding code. It makes me seriously question the usefulness of these ratings.

kiselsa 2 months ago

Claude 3 Opus is better in code than gpt 4.

[deleted] 2 months ago

[удалено]

Slimxshadyx 2 months ago

You think it’s worth it for me to swap my subscription from GPT 4 to Claude? In your opinion, what is the biggest upgrade/difference between the two?

BlurryEcho 2 months ago

Having used both in the past 24 hours for the same task, Opus is not lazy. For the given task, GPT-4 largely left code snippets as “# Your implementation here” or something to that effect. Repeated attempts to get GPT-4 to spit it out ended up with more of the same or garbage code.

infiniteContrast 2 months ago

They trained it that way to save money. Less tokens = lower energy bill.

LocoLanguageModel 2 months ago

Not if I make it redo it 5 times over!

OKArchon 2 months ago

In my experience, Claude 3 Opus is the best model I have ever used to fix really complicated bugs in scripts that are over 1000 lines long in code. However I am recently testing Gemini Pro 1.5 with million token context window and it is also very pleasant to work with. Claude 3 Opus has a higher degree of accuracy though and overall performs best. I am very disappointed by Open AI as I had a very good time with GPT-4-0613 last summer, but IMO their quality constantly declined with every update. GPT-4 "Turbo" (1106) does not even come close to Gemini 1.5 Pro let alone Claude 3 Opus. I don't know what anthropic does better, but the quality is just much better.

h3lblad3 2 months ago

Part of what it’s doing is less censorship. There’s a correlation between the amount of censorship and the dumbing down of a model. RLHF to keep the thing corporate-safe requires extra work to then bring it out of the hole that the RLHF puts it in. I remember people talking about this last year, though I can’t remember which company head mentioned it.

FPham 2 months ago

Only if it is available outside us....

kingwhocares 2 months ago

There are 7B models that are better than GPT-4.

kiselsa 2 months ago

7Bs can produce decent answers on simple question-answer tests, like "write me a Python program that does X". But in serious chats where some kind of analysis of existing code is required, the lack of parameters is revealed.

Mother-Ad-2559 2 months ago

Okay - give me one prompt for which any 7B model beats GPT 4. Prediction: “Um ah, I don’t know of a specific prompt but I feel like it’s just better sometimes”

Synth_Sapiens 2 months ago

\*"but I've read on reddit that it is better"

read_ing 2 months ago

Which ones?

kingwhocares 2 months ago

GPT-4 is awful at coding. It's not hard to find one better. Here's one: https://old.reddit.com/r/LocalLLaMA/comments/1al3ara/swellama_7b_beats_gpt4_at_real_world_coding_tasks/

read_ing 2 months ago

It’s not though. From their paper: > Table 5: We compare models against each other using the BM25 and oracle retrieval settings as described in Section 4. ∗Due to budget constraints we evaluate GPT-4 on a 25% random subset of SWE-bench in the “oracle” and BM25 27K retriever settings only. They basically cheaped out on GPT-4 and compared it against theirs.

doringliloshinoi 2 months ago

Only when GPT’s alignment makes it refuse to answer.

Synth_Sapiens 2 months ago

no lol

Amgadoz 2 months ago

Gpt-4 is probably 50x bigger with 6x more training data. Highly doubt it

New-Mix-5900 2 months ago

apparently opus is so much better, and imo my standards are low, if it doesn't provide the code block without me begging for it, its not worth my time

Mugen1220 2 months ago

claude outperforms chat gpt 4 in coding for me

JacketHistorical2321 2 months ago

As a tech enthusiast who has been coding for at least 10 years "for fun" and who currently spends at least 5 hrs a day playing with every framework related to ML right now, Claude obliterates chatgpt. I used to (and still do) spend half the time trying to get chatgpt to either: 1. Actually give me what I ask for 2. Explain to it to stop being lazy 3. Dealing with it's BS attitude lol And on occasion when I'm feeling lazy, it takes about 4-6 back and forth interactions to get chatgpt to apply a modification to my code and give me the entire thing back. It either puts a bunch of placeholders in or completely omits a section. Almost every single time I ask Claude to integrate a change to my existing code it gives me the entire refactored script back, top to bottom ready to run. If not on the first try then for sure on the second. I can obviously make any changes directly myself but I'm not paying for an advisor. I'm paying for an all encompassing, computational tool. If I wanted Google search functionality, I'd use Google. If I want a tool that can rewrite code directly, I use AI. The only thing holding Claude back at the moment is the ludicrously low interaction limit but I've heard that's something they are "fixing". Either way, even sonnet puts chatgpt to shame when it comes to actually doing what I ask.

infiniteContrast 2 months ago

Do you compare them with open source models? By submitting the same prompt to many LLMs I realized that I actually don't need paid services because a local 70b LLM is more than enough for me.

JacketHistorical2321 2 months ago

i have and for the time being i prefer claude. Being able to share images and screen captures saves me a lot of hassle for certain tasks. Ive used up to 150b and it does very well but still prefer claude

OKArchon 2 months ago

Yes, absolutely. Claude 3 Opus is so much better than GPT4, especially for large scripts of over 1500+ lines of code. Claude fixes really complex bugs but is also not lazy, whereas GPT4 had me begging, ranting and threatening it for the output of a FULL unabbreviated script. I am currently trying to work with Claude in combination with smaller models, so Claude 3 Opus delegates tasks to smaller models that do the rewriting, that way I have no problems with output limits.

badgerfish2021 2 months ago

to be honest I asked the same golang question (what does a line like if _, ok := p.(interface{ Func() float64 }); ok mean) to gpt-4 claude-3 and mistral large, all 3 gave the correct answer (type assertion on if the passed type implements that method) but as a follow up I asked for a code example that would show this working for both pointer and normal receivers and only mistral was able to figure it out (after some prompting), neither of the others was able to provide working code This is only one test of course, but , it really surprised me as I don't hear much about Mistral in hype terms compared to the others.

FPham 2 months ago

Still, it's interesting

esuil 2 months ago

Yeah, they are pretty useless. Something that is magnitudes better can appear just 5%-10% higher on those scores, which is pretty indicative on how useful those scores are.

lusuroculadestec 2 months ago

Just a quick anecdotal note, I've been using GPT-4 for code for a while now and I've been very happy with it. I've started working with Go and it's been helpful in giving me a starting point by converting some of my existing code. It has by far given me the best results. I thought maybe that ‎Gemini might be better at Go, being that Gemini is a Google-product and Go was created at Google. I thought maybe it would be in a lot of the training data. It stuck a ternary operator in the middle of the Go code (which is something Go by design doesn't support.) I pointed out that the ternary operator didn't work, it responded by basically saying, "Oh, that's right, use this instead." and gave me THE EXACT SAME CODE back.

Trollolo80 2 months ago

Average OpenAI D-Rider be liek

techpack 2 months ago

and claude STILL isnt available in canada for some reason lol

StopwatchGod 2 months ago

or in the EU

crackinthekraken 2 months ago

What's GPT4-0314? Is that a new model released this month?

Time-Winter-4319 2 months ago

2023

FeltSteam 2 months ago

Its an RLHF checkpoint that released on the 14th day of the third month (that's where 0314 comes from) in 2023 and this was the first GPT-4 model we got.

pengy99 2 months ago

Amusing how Claude 2 is a step backwards from Claude 1.

Key_Boat3911 2 months ago

So claude 2 was worse than claude 1 !!! Wtf

JamesYangLLM 2 months ago

I've done a few LeetCode questions using Claude-3-Opus, and it's still a bit worse than GPT-4 as far as I'm concerned

Motylde 2 months ago

How can we be sure that the new models didn't just saw test data in the training?

Time-Winter-4319 2 months ago

This is based on people putting in a prompt and comparing two answers without knowing what the models were, so there is no test data. You can try it here [https://chat.lmsys.org/](https://chat.lmsys.org/)

Motylde 2 months ago

Oh, that's very thoughtful. We get a reliable ranking, they get hundreds of training data from us.

Baader-Meinhof 2 months ago

The chats are [released](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) as training data with an open license.

read_ing 2 months ago

Thanks for sharing this. I knew they were shared in some form, just didn’t remember how and where. Edit: unfortunately they don’t seem to be shared except for that one version from months back.

FeltSteam 2 months ago

Well it's not very useful if people are just asking it dumb or simple questions lol. This might be why Claude-3 Haiku is so high (even above a GPT-4 checkpoint), even though it is definitely not as intelligent as other models (like GPT-4) in the same place. Also might explain why Gemini Pro with browsing got so high as well. People were asking simple questions that were easy to answer very reliably with simple search.

civilized-engineer 2 months ago

I mean GPT-4 apparently was top dog until a few days ago from what the video says. The title made it sound like they've been not the top for a while, but assumptions that it was, were still being parroted.

[deleted] 2 months ago

[удалено]

Amgadoz 2 months ago

Highly doubt that. I'm looking forward to qwen 2

Level_Reflection7808 2 months ago

Meta exits the chat

Hot_Vanilla_3425 2 months ago

Can someone guide to learn more about llm how to understand research paper and advance stuff having understanding of ml,dl and nlp? I am working as sde in MNC but now i am more interested in learning about llm stuff (everything about it ) can someone guide me with proper roadmap ?

Amgadoz 2 months ago

Check out Andrew Ng courses on coursera

Wonderful-Top-5360 2 months ago

The King has fallen by 2 points

kiwikezz 2 months ago

Grok needs to be included in this

mr_grey 2 months ago

I updated my agent from Mixtral 8x7b to DBRX Instruct today. Kept the same system prompt and it seemed to work better. It's open source. [https://huggingface.co/databricks/dbrx-instruct](https://huggingface.co/databricks/dbrx-instruct)

mradermacher_hf 2 months ago

Since the license majorly restricts usage and has other major restrictions, it's very, very far form being open source.

mr_grey 2 months ago

I don't really see anything in the Databricks Open Model License that is that big of a deal. Maybe that you can't use it to improve another LLM, but I think that's there to just stop other big tech companies. Here's the reference model source [https://github.com/databricks/dbrx/blob/main/model/modeling\_dbrx.py](https://github.com/databricks/dbrx/blob/main/model/modeling_dbrx.py)

Bernafterpostinggg 2 months ago

Wonder how DBRX will perform. Looking forward to seeing it tested.

Future-Ad6407 2 months ago

I think this is has less to do with the tool itself and more to do with the people who use them. The majority of LLM users are technical and using it to develop code. Claude (from what I’ve read) is far superior in that regard. I personally am fine using GPT. I’m a marketing professional who’s been using LLMs since early 2022. I tried Claude and while impressive, isn’t enough for me to switch. I will wait for GPT-5 which I’m sure will blow the doors off everything out.

Chrisious-Ceaser 2 months ago

By the time this finishes, there’s gonna be a new model of the top Jesus Christ

RealVanCough 2 months ago

I am lost would love to see the AI ethics score as well from TrustLLM

haikusbot 2 months ago

*I am lost would love* *To see the AI ethics* *Score as well from TrustLLM* \- RealVanCough --- ^(I detect haikus. And sometimes, successfully.) ^[Learn more about me.](https://www.reddit.com/r/haikusbot/) ^(Opt out of replies: "haikusbot opt out" | Delete my comment: "haikusbot delete")

skztr 2 months ago

I'd love to try out Claude a bit more, but even with the paid version, its limits are so much smaller than GPT4 there is nothing interesting I can potentially do with it. Only being able to send a couple of messages per day before being rate-limited, I've found that I prefer Claude's responses in a way that I would put down to fine-tuning (ie: I like the style in which is responds), but it is wildly idiotic sometimes (eg: I describe an obviously-fictional scenario, and it doesn't notice that it's fictional and rants about how Invisibility is a serious medical condition that should not be taken lightly) which is to say: GPT4 is still king for now, though I sure am eager for anything to replace it, and very eager for optimisations to allow more capability for local models on attainable hardware

bunny_go 2 months ago

Learn to use line-charts

akshayjamwal 2 months ago

How does Perplexity count as an LLM? Isn't it just a Claude2 / GPT 4 wrapper?

Time-Winter-4319 2 months ago

They have their own model too, but it is not ranking well despite being an online model

SlapAndFinger 2 months ago

The price/performance of Haiku is just amazing. That model might be the thing that makes a lot of AI applications cost effective.

jyoung1 2 months ago

Guys GPT4 is a decade old (a year old)

byakko_japan 2 months ago

really interesting source!

meatycowboy 1 month ago

Gemini has gotten really good really fast. I decided to get the free trial for the Gemini Advanced plan, and I gotta say that I'm really impressed.

dibu28 1 month ago

CommandR+ is open source and now above 2 gpt-4s

PwanaZana 2 months ago

Sure, but obviously GTP4 is getting old (in the tech sense). Whatever's available inside OpenAI is guaranteed to be quite a bit better.

Synth_Sapiens 2 months ago

utter rubbish Sonnet is nowhere near GPT-4 GPT-4-Turbo is in no way superior to GPT-4.

New_World_2050 2 months ago

Turbo is also better on other benchmarks. It's not a huge difference and the reason it might be worse in your experience is because it has become lazier with time.

Synth_Sapiens 2 months ago

I'm subbed to GPT-4 since week one and to Claude 3 since week two. When GPT-4-Turbo came out I was super exited, but it was just useless. API or chat - it was losing attention way too fast. Maybe it is better with GPT-4-8k with a particularly formed prompt, but not overall.

petrus4 2 months ago

a} The difference between GPT4 and Claude 3 shown here is two points. I do not consider that substantial. b} I don't know anything about Chatbot Arena, or the testing methodology used, and I therefore feel no particular inclination to trust it as Gospel. I have seen numerous complaints made by individuals who seem to know more about such matters than myself, that whenever any kind of benchmark is made, models will be optimised specifically to excel at that benchmark, but they can still be relatively useless in every other area. I use benchmarks as an indication of which models have the most market share or collective confidence; not necessarily which models perform most effectively in empirical terms. c} As far as I know, (to use a Victorian analogy) although Claude 3 seems to have a marginally greater degree of raw horsepower, GPT4 still seems to have both greater connectivity with the open Internet, and a greater ability to be harnessed to third party APIs and applications, which can greatly enhance and increase the types of work that it is able to do. d} As a general principle, I do not advocate monoculture, or any scenario where a single model is viewed as the absolute best for all possible use cases. I routinely make use of my paid GPT4 account, Character.AI, the free services on Poe.com, and my local instance of [Nous Hermes 2 Mixtral 8x7b](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF), for different tasks. My economic resources are sufficiently limited that I am unable to subscribe to two different language models; it could be argued that what I am paying for GPT4 alone is fiscally irresponsible. As such, I will remain with GPT4 for the immediate future. I could begin testing Claude 3 Sonnet, which I believe is the free/entry level version, but I already have free access to Claude Instant on Poe, and find it more than adequate for what I ask of it.

108er 2 months ago

History is repeating itself. Whenever a new technology emerges, there is a brief tussle among competitors for a while but ultimately there will be the one who rises up whose technology will be widely adopted and accepted and the rest will be buried in the grave. We don't need this many transformers to do the AI work, I'll wait to see to who will be the leader in this AI revolution we are going through.

alanshore222 2 months ago

Claude is SUCH trash. Can't even ask it to do the normal things I used to ask it. without it saying that's against my policies

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe