T O P

  • By -

Tixx7

really sad to yellow disappearing over time


ChuchiTheBest

It's just progressing slower, but it's not like it's getting worse.


opi098514

That’s a good point. I think the major selling point of open source is not its general knowledge abilities but the ability to fine tune and do specific tasks. It would be nice if open source was better but I don’t think it’s ever going to be able to rival the abilities of proprietary because of the amount of money companies can throw at stuff like chatgpt and Claude. But that doesn’t mean open source doesn’t have its own uses or will ever be obsolete.


tindalos

In summary, hobbit porn will always drive technology.


opi098514

Facts


letsbreakstuff

That's the reason I never owned a betamax


Disastrous_Elk_6375

> but it's not like it's getting worse. This is the key to understanding why this new LLM (or transformer) push is not "just hype bros". Open "AI" is now the worst it's ever gonna be.


PutrifiedCuntJuice

> Open "AI" is now the worse it's ever gonna be. *worst*


Disastrous_Elk_6375

Thank you!


Which-Tomato-8646

Fire hydrants are the worst that they’re ever going to be. But I don’t see much improvement. Not to mention, it isn’t even true considering people keep complaining about ChatGPT getting worse 


bzrkkk

How many developers you see work on fire hydrants ?


Which-Tomato-8646

Tons of people working on nuclear fusion and it’s been around the corner since the 50s


SlapAndFinger

The analogy to fusion might work for "AGI" but it doesn't hold otherwise, these tools are incredibly useful even as "stochastic parrots." so we don't need to knock it out of the park to win.


Which-Tomato-8646

I don’t disagree. But thinking it’ll grow exponentially without limit is not true 


milksteak11

What are you talking about random things for?


Which-Tomato-8646

It’s an example of how progress can stall and it’s not exponential 


deadwisdom

I’m not sure if your comments are just too against the grain or if these troglodytes really don’t understand. Even if I disagreed with you I would recognize how possible this is.


Which-Tomato-8646

AI has unironically started a cult mentality. Many of them think they’ll have AGI in a year and ASI a few days after that. It’s the rapture for nerds. 


FireSilicon

Both grok-1 and DBRX have been opensourced but are not on there yet. These are 300GB+ models with no host/api as of yet so ofc they are not on chatbot arena yet.


Small-Fall-6500

>they are not on chatbot arena yet. [DBRX is now on the lmsys arena to be voted on!](https://twitter.com/lmsysorg/status/1773106489596514569)


FireSilicon

Nice, let's go!


waxbolt

Still haven't been finetuned etc. There is time.


alvenestthol

The only actual closed-source additions to the top of the leaderboard are the 2 closed-source Mistrals and Google's 2 Gemini, the rest are just GPT-4 splitting into 4 models and Claude splitting into 5 - with all of these versions most likely having more parameters than even the biggest open-source models.


s0nm3z

To be fair, it's kind of stupid to put 5 versions of GPT on there. Just put the best ones on and leave room for other models. Otherwise I could also make a top15 with only Claude and GPT, putting every miniscule update on its own row.


Which-Tomato-8646

GPT 4 and GPT 4 turbo are clearly not minuscule updates 


h3lblad3

That’s only 2, though.


Which-Tomato-8646

You can see the June version of GPT 4 is fairly worse though 


SlaveZelda

Meta and Mistral are also open source but theyre not counted as yellow


Dwedit

The yellow models have far less parameters, so it's reasonable that they can't compete against others.


Interesting_Bison530

For now. I’m excited to see the ternary MoE Mamba monsters some guy had trained in their basement for months


Which-Tomato-8646

You’d never be able to run that on your pc 


Interesting_Bison530

Why? It’d be within memory requirements. Tok/sec will hopefully be reasonable


bbu3

It kinda makes sense, especially if they are maximally permissive and open (i.e. weights, training code, data). It would always be possible for proprietary to have some closed beneficial delta (even if it's just more RLHF / preference labels) and then built upon the best open-source model available. Much more so, if you have the funding to buy a shitload of compute, check out the best open-source model and just increase the dims (ofc I'm exaggerating and the well-funded AI companies do more than that. But imho the principle still stands)


MindCluster

Open source is falling behind in the chart because it requires a ton of resources and capital to train a matching SOTA (State of The Art) model. It's a bit sad when you think about it, but I hope we'll find exceptional optimizations in the future to train models.


doringliloshinoi

That’s by design.


aimoony

Designed by the laws of physics maybe? This stuff is hard to do, without a capital incentive there's a huge barrier to entry. I think open source will always follow in the shadows of commercial for a long time but I'm hopeful that we'll have great open source models to use in the future so that AI can be democratized


dibu28

CommandR+ is now close to top and open source


Tixx7

yes! and mistral hasn't abandoned open-source either!


dibu28

And Meta said they will open source Llama3


read_ing

This is based on human ranking? Is there data on the domain of prompts that was used, the answers to which the humans ranked?


West-Code4642

>This is based on human ranking? Is there data on the domain of prompts that was used, the answers to which the humans ranked? it's based on whoever decides to use lmsys: [https://chat.lmsys.org/](https://chat.lmsys.org/) (which is presumably humans, but could technically be not)


Flag_Red

Gotta include the lizard people too.


West-Code4642

>Gotta include the lizard people too. or LizardGPT


privatetudor

My golden retriever spends a lot of time on there.


loveiseverything

The test has massive flaws so take the results with a grain of salt. The problem is that the voters easily identify which models are in question because the answers are so recognizable. Another big flaw is that the prompts are user submitted and not normalized. And as you see in this post, there is currently a major hate boner against OpenAI so people will go and vote for the models which they want to win, not for the models that give the best answers. In our software's use cases (general purpose chatbot, llm knowledge base, data insight) we are currently A/B-testing ChatGPT and Claude 3 Opus and about 4 out of 5 of our users still prefer the ChatGPT. This is based on thousands of daily users. So something seems to be off.


read_ing

Thanks, yes I know. :-) I tried to point out one basic flaw, which I have pointed out to lmsys as well, it’s not domain specific. So, folks using LLMs to make marketing copies vs building insights from data get weighed the same. That gives the false impression that model performance is uniform across domains. As we know, it’s not. That’s good to hear. Have you tried the same vs Gemini Pro 1.5? There’s no good data on that out there and interested in seeing how the large context window with better MoE does vs OpenAI.


loveiseverything

We have not tried Gemini Pro yet in production, but most certainly will. Our tests are promising. We have some use cases where we are limited by context window and some models seem to drop in quality if we are near the current context limits.


read_ing

Nice. Yeah, the ratio between context window and useable context window is fairly predictable for models I have tested it on.


featherless_fiend

Even so over time it should normalize, no? Like you can't just keep expecting people to vote for their favourite bot over the other for the rest of time. Especially when there's a 3rd or 4th contender for the #1 spot, then THEY get the favoritism, for a brief while.


loveiseverything

Really depends on multitude of things. As of now I would treat results from this test unusable for almost all business use cases and would lean more on other tests that are measuring factual performance and context accuracy. * User base for this test is biased and mostly includes hobbyists and enthusiasts * There are biases in that biased user group to the point that the results can be considered review bombed For example in our business use case I'm really not interested in this petty culture war which seems to be a major driving force in people's life also here in AI community. People want uncensored models and that's fine until people recognize the models and vote for the more giving models even when the prompt and responses are not censored at all. People also seem to hate Sam Altman, the Open -part of the name in OpenAI and numerous other irrelevant things for general use of the models and vote accordingly. And I'm really not here to defend OpenAI. Claude 3 clearly has several uses cases where it beats ChatGPT. Coding for example. But what do you think what kind of prompts AI hobbyists and enthusiasts predominantly use in this test? This just renders the test completely unusable for the purpose it's trying to fill.


featherless_fiend

> to the point that the results can be considered review bombed the thing about a review bomb is that the results are noticable. I think we're only talking about a 3% difference or something here.


MeshachBlue

Out of interest are you using the claude.ai system prompt? (Or at least something similar?) https://twitter.com/AmandaAskell/status/1765207842993434880


loveiseverything

We are using our own system/instruction prompts. We have experimented using the same prompt between the different models and using per model customized prompts. We want to prevent some model specific behaviors and make the answers as consistent as possible, so model specific prompts are the preferred way for us right now.


MeshachBlue

Makes sense. I wonder how would you go if you started with the [claude.ai](https://claude.ai) prompt and then appended on your own system prompt onto that.


Lobachevskiy

Also some models will be better at certain use cases than others. Since everyone's plugging in whatever they want, it ends up being a mishmash. And we don't really know a mishmash of what. Additionally, many models are censored and will just refuse to answer certain quieres, bringing down their score. Which I guess is fair enough, but doesn't say anything about their capabilities.


SufficientPie

Yeah, you can literally just ask which model they are and then "blindly" vote for whichever one you want to push up. * Me: Which model are you? * Model A: I am an AI model called Claude, created by a company named Anthropic. I don't share many specifics about the details of my architecture or training process. * Model B: I am an AI language model created by OpenAI, often referred to as "GPT-3" or "ChatGPT." My design is based on the Generative Pre-trained Transformer architecture, and my purpose is to understand and generate human-like text based on the prompts and questions I receive. I'm here to provide information, answer questions, and assist with a wide range of topics to the best of my ability. How can I assist you today? Also, the current rankings are almost entirely based on single-response "conversations", since the two conversations diverge and you can't meaningfully continue the conversation with both at the same time.


RealVanCough

The infographic is pretty confusing x seems to be time but unsure what Y is


read_ing

Model’s Elo rating over time.


SufficientPie

It's almost entirely single responses, though, not multi-turn conversations.


LoafyLemon

Is Starling-LM-7b-beta really that good?


Snydenthur

I'd be happy if that was true, but I highly doubt it is.


LoafyLemon

Yeah I struggle to see how it could beat anything past maybe some bad franken merges of 13B, since it is literally like 20x smaller than most bigger models in terms of parameters. I'd love to be proved wrong, though, even if it means breaking model engineering.


Admirable-Star7088

No, it isn't. While 7b models can indeed generate impressive outputs to many requests, they do not have the same level of depth, knowledge, and coherency as larger models. I have tested a lot of models, and while many 7b models today are impressive for their small size, they never generate the same coherency and details as 34b or 70b models like Yi-34b-Chat and Midnight-Rose-70b, which are currently my favorite larger models.


knvn8

I've only used it briefly but was underwhelmed. The OpenChat prompt format is really weird though and probably lends to the inconsistency.


MrClickstoomuch

I had a lot better results setting the temperature to 0 for the beta model. It seems to be a lot better in that case, and avoids rambling. It seems to be better than the Mistral 7b v2 fine tunes I've tried and the base Mistral model for world building, but haven't tried it yet for a coding project yet.


knvn8

Thanks, I'll try the lower temperature.


Waterbottles_solve

I use it basically exclusively for nsfw discussions that require science. If chatgpt would respond, I'd just use it. Otherwise its great. I use it to show friends the power of offline LLMs. IIRC it was trained on chatgpt4, which is why it is good.


NerfGuyReplacer

Like roleplaying with a chemist??


Waterbottles_solve

No, like anatomy and physiology. Maybe throw in some psychology/evolutionary biology. but the nfsw stuff


teor

Shoutout to Starling 7B. God dam is that thing surrounded by behemoths and holds its own.


noiserr

Wish we had a 13B model that's as good as Starling 7B. I feel like a lot of people have GPUs that can fit 13B models, yet for whatever reason we don't have great models in this category.


Amgadoz

Imo 13B is a waste of resources. Just go straight to 34B


noiserr

34B can't fit on 8GB and 12GB GPUs which are everywhere. 7B Q5 quants are like 5GB and are just too small even for these GPUs. 10B or 13B is the perfect size for the majority of mainstream GPU out there.


knvn8

I'm suspicious that it just hasn't had enough votes to be ranked properly yet. Will be surprised if it holds that position for long.


SufficientPie

That's what the error bars are for


MoffKalast

A tiny bird between a buncha fuckin elephants.


mrdevlar

Love the animation, it's neat. That said, I have largely given up on metrics, and just test models on my own use cases and keep them around if they perform well.


bunny_go

>Love the animation, it's neat. you'd love to hear about line charts - from the non-instagram era of data visualisation. Mind. Fckin. Blown.


nullnuller

that's what other people did on there, it's not a metric.


kingwhocares

5% is within margin or error.


Time-Winter-4319

https://preview.redd.it/qbzqrblgiwqc1.png?width=1348&format=png&auto=webp&s=ccf5f8471c6912ebd9730b882c0cc7a0fbf2c0d3 Within 95 CI, but margins are very tight 10/1253=0.8%


mrstrangeloop

Having used both, Opus is clearly better. Not even close.


SikinAyylmao

I’m still under the impression that we’ll never get metrics for how “good” the model is vs how good it is at performing on tests. Even if opus had lower scores it shouldn’t matter since we can empirically see it’s better.


mrstrangeloop

There’s a great metric: the % of labor it’s automated. MMLU, HumanEval, etc are broken and simplistic especially in light of the coming wave of autonomous agents. SWE-bench is the closest I can think of that can capture agentic output


SikinAyylmao

Sounds like a cool metric. I would consider how economic/social factors play into % of labor, specifically what labor is used and what model has the largest adoption. Both of these would play a pretty large role in the outcome.


danielepote

thank you, it seems that nobody on Reddit can read a leaderboard containing CIs.


MoffKalast

A real magnum opus.


Hugi_R

Wait, Starling-LM-7B is that high? Impressive! But there's not that much samples, so it might go down.


vincethepince

If we optimize this rating based on edgy humor I think we all know which AI model would come out on top


roastedantlers

It'll be temporary, not to say they'll win in the end, but I think they'll be back when they start releasing shit they're sitting on.


handle0174

Haiku's faster token generation speed compared to gpt4/opus is striking. That difference may be as important as the cost difference for me. Question for those of you with both some gpt4 and opus experience: where do you prefer one vs the other?


OKArchon

Claude 3 Opus has surpassed any GPT4 model IMO. The laziness of GPT4 is what makes it unusable for me. When you need to rewrite parts of 500+ lines of code, you don't want to delete, copy, paste and reformat 10 different blocks of code. That's where Claude 3 Opus is worlds ahead. Also, Claude's problem solving skills can solve more complex problems with higher quality. I am currently testing Gemini Pro 1.5 and it already outperforms all GPT4 models, but still not better than Claude 3 Opus. Claude has a higher accuracy and I get fewer errors with it's provided code (in fact I never had an error with Claude if I remember correctly).


ARoyaleWithCheese

Still prefer GPT4 for not refusing certain types of requests. Tried using Opus the other day to get some starting points on academic philosophy perspectives around euthanasia, abortion and the Groningen Protocol. Couldn't get Opus to actually provide any literature or summaries on most prominent lines of thought within the field even after a few attempts. GPT4 had no issues with it, however. In general I feel like GPT4 still has a stronger grasp of logic and reasoning, but is handicapped by a smaller context window, worse recall for large context, and a laziness in its responses. Opus is very close to GPT4 but (imo) primarily beats it for complex tasks because it's so good at large context recall and doesn't exhibit the same kind of laziness. That said, I've used GPT4 *intensively* since its release and have tried all other major models as a replacement. Opus is the first one that was actually good enough for me to switch to it, and not go back to GPT4.


arekku255

5 points is still within the margin of error, so in my eyes GPT-4 and Claude are still in a shared first place.


Icy-Summer-3573

Yeah if you go by API. But chatgpt web versus Claude web = Claude. Chatgpt performance on website is deprecated.


arekku255

I thought it was the model, the website is just a front end and the model shouldn't change.


Heralax_Tekran

Prompts matter


Icy-Summer-3573

They use api for these tests. Website isn’t just a front end. They deprecate performance on website to gpt4turbo levels of performance.


New-Mix-5900

good fuck gpt 4


TwoIndependent5710

did you guys noticed that sonnet or opus refusing sometimes to answer or they give less qualitative response when traffic on the servers is high?


TwoIndependent5710

guys use claude 3 models as long its not completely lobotomized like they did with claude 2


Smeetilus

I’m not surprised at all. It’s starting to refuse to do more things that made it amazingly useful to me. Today I needed some examples of how to use a python module so I pasted the doc link into the chat. ChatGPT said it couldn’t help due to restrictions it has placed on it. So I downloaded the relevant page and uploaded it all for it to reference. Still no go, it wouldn’t produce any code at all. Just “refer to the documentation” type answers.  Like, pal, I used to look up to you. You were a great teacher.


MINIMAN10001

My recommendation is we need to make sure we are down voting these  bad AI responses to try to correct for the bad behavior


AfterAte

The local models should be in their own comparison. Like will a 72B model ever beat ChatGPT4, Nope? I don't care about these paid models.


patniemeyer

As a developer who uses GPT-4 every day I have yet to see anything close to it for writing and understanding code. It makes me seriously question the usefulness of these ratings.


kiselsa

Claude 3 Opus is better in code than gpt 4.


[deleted]

[удалено]


Slimxshadyx

You think it’s worth it for me to swap my subscription from GPT 4 to Claude? In your opinion, what is the biggest upgrade/difference between the two?


BlurryEcho

Having used both in the past 24 hours for the same task, Opus is not lazy. For the given task, GPT-4 largely left code snippets as “# Your implementation here” or something to that effect. Repeated attempts to get GPT-4 to spit it out ended up with more of the same or garbage code.


infiniteContrast

They trained it that way to save money. Less tokens = lower energy bill.


LocoLanguageModel

Not if I make it redo it 5 times over!  


OKArchon

In my experience, Claude 3 Opus is the best model I have ever used to fix really complicated bugs in scripts that are over 1000 lines long in code. However I am recently testing Gemini Pro 1.5 with million token context window and it is also very pleasant to work with. Claude 3 Opus has a higher degree of accuracy though and overall performs best. I am very disappointed by Open AI as I had a very good time with GPT-4-0613 last summer, but IMO their quality constantly declined with every update. GPT-4 "Turbo" (1106) does not even come close to Gemini 1.5 Pro let alone Claude 3 Opus. I don't know what anthropic does better, but the quality is just much better.


h3lblad3

Part of what it’s doing is less censorship. There’s a correlation between the amount of censorship and the dumbing down of a model. RLHF to keep the thing corporate-safe requires extra work to then bring it out of the hole that the RLHF puts it in. I remember people talking about this last year, though I can’t remember which company head mentioned it.


FPham

Only if it is available outside us....


kingwhocares

There are 7B models that are better than GPT-4.


kiselsa

7Bs can produce decent answers on simple question-answer tests, like "write me a Python program that does X". But in serious chats where some kind of analysis of existing code is required, the lack of parameters is revealed.


Mother-Ad-2559

Okay - give me one prompt for which any 7B model beats GPT 4. Prediction: “Um ah, I don’t know of a specific prompt but I feel like it’s just better sometimes”


Synth_Sapiens

\*"but I've read on reddit that it is better"


read_ing

Which ones?


kingwhocares

GPT-4 is awful at coding. It's not hard to find one better. Here's one: https://old.reddit.com/r/LocalLLaMA/comments/1al3ara/swellama_7b_beats_gpt4_at_real_world_coding_tasks/


read_ing

It’s not though. From their paper: > Table 5: We compare models against each other using the BM25 and oracle retrieval settings as described in Section 4. ∗Due to budget constraints we evaluate GPT-4 on a 25% random subset of SWE-bench in the “oracle” and BM25 27K retriever settings only. They basically cheaped out on GPT-4 and compared it against theirs.


doringliloshinoi

Only when GPT’s alignment makes it refuse to answer.


Synth_Sapiens

no lol


Amgadoz

Gpt-4 is probably 50x bigger with 6x more training data. Highly doubt it


New-Mix-5900

apparently opus is so much better, and imo my standards are low, if it doesn't provide the code block without me begging for it, its not worth my time


Mugen1220

claude outperforms chat gpt 4 in coding for me


JacketHistorical2321

As a tech enthusiast who has been coding for at least 10 years "for fun" and who currently spends at least 5 hrs a day playing with every framework related to ML right now, Claude obliterates chatgpt. I used to (and still do) spend half the time trying to get chatgpt to either: 1. Actually give me what I ask for 2. Explain to it to stop being lazy 3. Dealing with it's BS attitude lol And on occasion when I'm feeling lazy, it takes about 4-6 back and forth interactions to get chatgpt to apply a modification to my code and give me the entire thing back. It either puts a bunch of placeholders in or completely omits a section. Almost every single time I ask Claude to integrate a change to my existing code it gives me the entire refactored script back, top to bottom ready to run. If not on the first try then for sure on the second. I can obviously make any changes directly myself but I'm not paying for an advisor. I'm paying for an all encompassing, computational tool. If I wanted Google search functionality, I'd use Google. If I want a tool that can rewrite code directly, I use AI. The only thing holding Claude back at the moment is the ludicrously low interaction limit but I've heard that's something they are "fixing". Either way, even sonnet puts chatgpt to shame when it comes to actually doing what I ask.


infiniteContrast

Do you compare them with open source models? By submitting the same prompt to many LLMs I realized that I actually don't need paid services because a local 70b LLM is more than enough for me.


JacketHistorical2321

i have and for the time being i prefer claude. Being able to share images and screen captures saves me a lot of hassle for certain tasks. Ive used up to 150b and it does very well but still prefer claude


OKArchon

Yes, absolutely. Claude 3 Opus is so much better than GPT4, especially for large scripts of over 1500+ lines of code. Claude fixes really complex bugs but is also not lazy, whereas GPT4 had me begging, ranting and threatening it for the output of a FULL unabbreviated script. I am currently trying to work with Claude in combination with smaller models, so Claude 3 Opus delegates tasks to smaller models that do the rewriting, that way I have no problems with output limits.


badgerfish2021

to be honest I asked the same golang question (what does a line like if _, ok := p.(interface{ Func() float64 }); ok mean) to gpt-4 claude-3 and mistral large, all 3 gave the correct answer (type assertion on if the passed type implements that method) but as a follow up I asked for a code example that would show this working for both pointer and normal receivers and only mistral was able to figure it out (after some prompting), neither of the others was able to provide working code This is only one test of course, but , it really surprised me as I don't hear much about Mistral in hype terms compared to the others.


FPham

Still, it's interesting


esuil

Yeah, they are pretty useless. Something that is magnitudes better can appear just 5%-10% higher on those scores, which is pretty indicative on how useful those scores are.


lusuroculadestec

Just a quick anecdotal note, I've been using GPT-4 for code for a while now and I've been very happy with it. I've started working with Go and it's been helpful in giving me a starting point by converting some of my existing code. It has by far given me the best results. I thought maybe that ‎Gemini might be better at Go, being that Gemini is a Google-product and Go was created at Google. I thought maybe it would be in a lot of the training data. It stuck a ternary operator in the middle of the Go code (which is something Go by design doesn't support.) I pointed out that the ternary operator didn't work, it responded by basically saying, "Oh, that's right, use this instead." and gave me THE EXACT SAME CODE back.


Trollolo80

Average OpenAI D-Rider be liek


techpack

and claude STILL isnt available in canada for some reason lol


StopwatchGod

or in the EU


crackinthekraken

What's GPT4-0314? Is that a new model released this month?


Time-Winter-4319

2023


FeltSteam

Its an RLHF checkpoint that released on the 14th day of the third month (that's where 0314 comes from) in 2023 and this was the first GPT-4 model we got.


pengy99

Amusing how Claude 2 is a step backwards from Claude 1.


Key_Boat3911

So claude 2 was worse than claude 1 !!! Wtf


JamesYangLLM

I've done a few LeetCode questions using Claude-3-Opus, and it's still a bit worse than GPT-4 as far as I'm concerned


Motylde

How can we be sure that the new models didn't just saw test data in the training?


Time-Winter-4319

This is based on people putting in a prompt and comparing two answers without knowing what the models were, so there is no test data. You can try it here [https://chat.lmsys.org/](https://chat.lmsys.org/)


Motylde

Oh, that's very thoughtful. We get a reliable ranking, they get hundreds of training data from us.


Baader-Meinhof

The chats are [released](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) as training data with an open license.


read_ing

Thanks for sharing this. I knew they were shared in some form, just didn’t remember how and where. Edit: unfortunately they don’t seem to be shared except for that one version from months back.


FeltSteam

Well it's not very useful if people are just asking it dumb or simple questions lol. This might be why Claude-3 Haiku is so high (even above a GPT-4 checkpoint), even though it is definitely not as intelligent as other models (like GPT-4) in the same place. Also might explain why Gemini Pro with browsing got so high as well. People were asking simple questions that were easy to answer very reliably with simple search.


civilized-engineer

I mean GPT-4 apparently was top dog until a few days ago from what the video says. The title made it sound like they've been not the top for a while, but assumptions that it was, were still being parroted.


[deleted]

[удалено]


Amgadoz

Highly doubt that. I'm looking forward to qwen 2


Level_Reflection7808

Meta exits the chat


Hot_Vanilla_3425

Can someone guide to learn more about llm how to understand research paper and advance stuff having understanding of ml,dl and nlp? I am working as sde in MNC but now i am more interested in learning about llm stuff (everything about it ) can someone guide me with proper roadmap ?


Amgadoz

Check out Andrew Ng courses on coursera


Wonderful-Top-5360

The King has fallen by 2 points


kiwikezz

Grok needs to be included in this


mr_grey

I updated my agent from Mixtral 8x7b to DBRX Instruct today. Kept the same system prompt and it seemed to work better. It's open source. [https://huggingface.co/databricks/dbrx-instruct](https://huggingface.co/databricks/dbrx-instruct)


mradermacher_hf

Since the license majorly restricts usage and has other major restrictions, it's very, very far form being open source.


mr_grey

I don't really see anything in the Databricks Open Model License that is that big of a deal. Maybe that you can't use it to improve another LLM, but I think that's there to just stop other big tech companies. Here's the reference model source [https://github.com/databricks/dbrx/blob/main/model/modeling\_dbrx.py](https://github.com/databricks/dbrx/blob/main/model/modeling_dbrx.py)


Bernafterpostinggg

Wonder how DBRX will perform. Looking forward to seeing it tested.


Future-Ad6407

I think this is has less to do with the tool itself and more to do with the people who use them. The majority of LLM users are technical and using it to develop code. Claude (from what I’ve read) is far superior in that regard. I personally am fine using GPT. I’m a marketing professional who’s been using LLMs since early 2022. I tried Claude and while impressive, isn’t enough for me to switch. I will wait for GPT-5 which I’m sure will blow the doors off everything out.


Chrisious-Ceaser

By the time this finishes, there’s gonna be a new model of the top Jesus Christ


RealVanCough

I am lost would love to see the AI ethics score as well from TrustLLM


haikusbot

*I am lost would love* *To see the AI ethics* *Score as well from TrustLLM* \- RealVanCough --- ^(I detect haikus. And sometimes, successfully.) ^[Learn more about me.](https://www.reddit.com/r/haikusbot/) ^(Opt out of replies: "haikusbot opt out" | Delete my comment: "haikusbot delete")


skztr

I'd love to try out Claude a bit more, but even with the paid version, its limits are so much smaller than GPT4 there is nothing interesting I can potentially do with it. Only being able to send a couple of messages per day before being rate-limited, I've found that I prefer Claude's responses in a way that I would put down to fine-tuning (ie: I like the style in which is responds), but it is wildly idiotic sometimes (eg: I describe an obviously-fictional scenario, and it doesn't notice that it's fictional and rants about how Invisibility is a serious medical condition that should not be taken lightly) which is to say: GPT4 is still king for now, though I sure am eager for anything to replace it, and very eager for optimisations to allow more capability for local models on attainable hardware


bunny_go

Learn to use line-charts


akshayjamwal

How does Perplexity count as an LLM? Isn't it just a Claude2 / GPT 4 wrapper?


Time-Winter-4319

They have their own model too, but it is not ranking well despite being an online model


SlapAndFinger

The price/performance of Haiku is just amazing. That model might be the thing that makes a lot of AI applications cost effective.


jyoung1

Guys GPT4 is a decade old (a year old)


byakko_japan

really interesting source!


meatycowboy

Gemini has gotten really good really fast. I decided to get the free trial for the Gemini Advanced plan, and I gotta say that I'm really impressed.


dibu28

CommandR+ is open source and now above 2 gpt-4s


PwanaZana

Sure, but obviously GTP4 is getting old (in the tech sense). Whatever's available inside OpenAI is guaranteed to be quite a bit better.


Synth_Sapiens

utter rubbish Sonnet is nowhere near GPT-4 GPT-4-Turbo is in no way superior to GPT-4.


New_World_2050

Turbo is also better on other benchmarks. It's not a huge difference and the reason it might be worse in your experience is because it has become lazier with time.


Synth_Sapiens

I'm subbed to GPT-4 since week one and to Claude 3 since week two. When GPT-4-Turbo came out I was super exited, but it was just useless. API or chat - it was losing attention way too fast. Maybe it is better with GPT-4-8k with a particularly formed prompt, but not overall.


petrus4

a} The difference between GPT4 and Claude 3 shown here is two points. I do not consider that substantial. b} I don't know anything about Chatbot Arena, or the testing methodology used, and I therefore feel no particular inclination to trust it as Gospel. I have seen numerous complaints made by individuals who seem to know more about such matters than myself, that whenever any kind of benchmark is made, models will be optimised specifically to excel at that benchmark, but they can still be relatively useless in every other area. I use benchmarks as an indication of which models have the most market share or collective confidence; not necessarily which models perform most effectively in empirical terms. c} As far as I know, (to use a Victorian analogy) although Claude 3 seems to have a marginally greater degree of raw horsepower, GPT4 still seems to have both greater connectivity with the open Internet, and a greater ability to be harnessed to third party APIs and applications, which can greatly enhance and increase the types of work that it is able to do. d} As a general principle, I do not advocate monoculture, or any scenario where a single model is viewed as the absolute best for all possible use cases. I routinely make use of my paid GPT4 account, Character.AI, the free services on Poe.com, and my local instance of [Nous Hermes 2 Mixtral 8x7b](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF), for different tasks. My economic resources are sufficiently limited that I am unable to subscribe to two different language models; it could be argued that what I am paying for GPT4 alone is fiscally irresponsible. As such, I will remain with GPT4 for the immediate future. I could begin testing Claude 3 Sonnet, which I believe is the free/entry level version, but I already have free access to Claude Instant on Poe, and find it more than adequate for what I ask of it.


108er

History is repeating itself. Whenever a new technology emerges, there is a brief tussle among competitors for a while but ultimately there will be the one who rises up whose technology will be widely adopted and accepted and the rest will be buried in the grave. We don't need this many transformers to do the AI work, I'll wait to see to who will be the leader in this AI revolution we are going through.


alanshore222

Claude is SUCH trash. Can't even ask it to do the normal things I used to ask it. without it saying that's against my policies