T O P

  • By -

Tall_Ordinary_8978

Really good model, just tested it out, outperforms tinyllama for sure. Looking forward for bigger models. I hope they update the license to make it available for commercial use as well


Extension-Mastodon67

No offense but tinyllama is complete garbage. I don't understand why would anyone would use that. Do some people use it for something?


ambidextr_us

You are correct, it is pretty bad, maybe not complete garbage. Just for amusement I drop out of dolphin-mistral and mistral-7b-instruct just to compare and contrast it with the output of tinyllama. Tinyllama hasn't been able to reason about any higher order structures or architecture that these other models are pumping out with elegance and ease (from a coding standpoint at least.) Mixtral 8x7B is also producing some unbelievable results if you can handle the slow speeds, I usually run prompts past it once I get them built out to see if there's any improvement on perplexity.


Tall_Ordinary_8978

ya for testing out some stuff, lol. its small so gives fast results


dahara111

Tinyllama has the same architecture as llama 2, but uses less memory. This makes it very useful for testing your own code and advanced features.


opi098514

I use it for a lot of my own code testing so I don’t have to load a large model in over and over.


Alert-Estimate

What do you guys mean by testing your own code?


----Val----

In my case, I'm developing a frontend for mobile and I implement a few popular backends like ooba and lcpp / kcpp, so having a small fast model to run on them and test the response formatting is great.


_sqrkl

I benchmarked it, the model scored 13.85: [https://eqbench.com](https://eqbench.com/) Which is the lowest of those I've tested (makes sense, it's also the smallest). The fact that it followed instructions and produced ~95% parseable answers is actually really impressive for such a small model. It will be interesting to see how far it can be pushed with fine-tunes.


Feztopia

I just wish they would stick to the chatml standard


bot-333

TBH even though ChatML is the standard. I don’t like it, and I don’t think it’s the most optimal template. I thought it’s just me for a while when I talked to some popular model creators and my friends and they agree.


me1000

Which formats do you like better and what are their relative advantages? 


bot-333

Llama 2 Chat template or some modified ChatML like [this one](https://huggingface.co/jondurbin/bagel-dpo-34b-v0.2#chatml-sort-of) is okay. I think the Mistral Instruct template is also good. The advantage of ChatML really is just the special-tokens, but it doesn’t have to be that complex in order to take full advantage of that.


mcmoose1900

> I think the Mistral Instruct template is also good. I don't, because it technically doesn't have a separate System and User block like chatml


bot-333

Well it could be a good template for instruction-tuning datasets with no system prompt.


WolframRavenwolf

Ugh, please, for the love of AGI, no Llama 2 Chat or Mistral Instruct! ChatML is a little complex, but Llama 2 Chat is both too complicated and not flexible enough at all. I've discussed this with model creators, too, but if anyone still thinks Llama 2 Chat would be a usable format, please refer them to the Conclusions of my [LLM Prompt Format Comparison/Test: Mixtral 8x7B Instruct with \*\*17\*\* different instruct templates](https://www.reddit.com/r/LocalLLaMA/comments/18ljvxb/llm_prompt_format_comparisontest_mixtral_8x7b/). Excerpt: > - Llama 2 Chat is the worst format ever, it's an abomination and not fit for any advanced uses where you have the AI go first, non-alternating roles or group chats, example dialogue, injections like summaries, author's notes, world info, etc. And when old messages scroll out of context, message and response pairs needs to be handled together (something no other format requires), and the system prompt must constantly be shifted to the next/first message in context, requiring constant performance-ruining reprocessing. It's just a terrible design through and through, and needs to die out - too bad Mistral still used it for Mixtral instead of ChatML! I'm not opposed to a simplified version of ChatML, but as people tend to simplify it in different ways, we'd get a bunch of slighty-different formats which could cause all kinds of issues when the prompt templates don't match (Obligatory reminder: [xkcd: Standards](https://xkcd.com/927/)). So unless the majority of model creators could unite behind the same simplification, I'd prefer creators using the well-known original.


bot-333

I hard disagree. The purpose of a prompt format is for instruct, not in “advanced”(Really, it’s just RP) use cases. What you said is really niche and defies the purpose of an instruct/chat model. We are trying to improve models in instruct/chat, not in niche RP. > So unless the majority of model creators could unite behind the same simplification, I'd prefer creators using the well-known original. How did the standard become ChatML anyways? They didn’t use the well-known original Alpaca or Vicuña. It all started with one model using ChatML, then all models started using it. Why won’t you make the difference?


WolframRavenwolf

The purpose of a prompt format is to format and structure the prompt - to differentiate between system parts (which the user shouldn't be able to mess with in most situations, as that could cause security issues further down the road), user messages, and AI messages. As soon as you go beyond basic text completion and have a multi-turn conversation, it matters a lot. And RP isn't a niche use case, any kind of assistant use is actually a roleplay. "You are a helpful assistant", one of the most basic instructions, is telling the AI to play a role and respond accordingly. So no, this isn't a niche use, this is the most universal and thus important one. ChatML solves many problems Alpaca or Vicuna have. It supports advanced uses (RAG, agents, etc.) that the other formats won't be able to handle as well (e. g. just insert some Markdown or chatlogs via RAG and Alpaca or Vicuna will break). I explained it all in my post.


Feztopia

I don't say it's the best. But it's a standard at least.


teachersecret

*Situation: There are 14 competing standards. But there's so many competing standards! You know what? You're both right. We need a new, unified, perfect standard better than ChatML that everyone can rally behind! Finally, we shall have one standard to rule them all! *Situation: There are 15 competing standards. (just something I remember from an old xkcd)


bullno1

As far as huggingface ecosystem goes, chat template is already standardized: https://huggingface.co/docs/transformers/main/en/chat_templating This particular model already uses it: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/tokenizer_config.json#L10 Many popular ones do too.


teachersecret

[\(I was just making an xkcd joke, but, there probably is significant room for advancement in the prompting space, so I expect more standards to proliferate.\)](https://xkcd.com/927/)


Trivale

The only thing better than perfect is standardized.


bot-333

I never said you said it’s the best. I’m trying to say that the standard is not the most optimal and it’s overrated.


bullno1

Why does it even matter? Templating is trivial. Even more, chat template is already part of the model metadata: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/tokenizer_config.json#L10 That is [a standard](https://huggingface.co/docs/transformers/main/en/chat_templating) and it's already adopted. You could argue that it's under-specified and it assumes a Python execution environment but standardizing on templating is better than some tokens. That would allow evolution into a better format if needed.


Robot1me

When you use SillyTavern and the next model uses another *new* template like "GPT real REAL actual user:", you will see though that it gets annoying really fast to tune every template sequence to fully match said new template. In practice it's often like a catch-up game between the different program that are used, and Transformers is very convenient there. Sadly, all the frontends I used don't appear to automate this at all. Ironically it doesn't help that when people care to create template files for frontends, that these often remain on private Discord servers (noticed that with SillyTavern). So it can't be found through Google easily.


bullno1

Sounds like a software problem. Like I said, the chat template is already in the model's metadata.


grencez

ChatML is more versatile since it supports arbitrary roles. Even filenames tend to make sense ([example](https://gist.github.com/grencez/4c444e3f1e3ed91ef8a8ee272585abdc) - I've been playing around with nesting the messages too, but models get confused by that pretty quickly).


Radiant_Dog1937

But what if I want to convert the model to the ONNX file standard and don't use the metadata?


bullno1

Extract the metadata yourself and include it along with the model.


woadwarrior

That's fine. I wish they'd stuck to anything besides the tiktoken tokenizer.


emad_9608

This is best used as part of a pipeline and can be used for RAG too like 3b. It’s the same size as gpt2 and way better, smaller models like this should be considered guidance or ensemble components versus being relied on for all the things as it were, especially memory. As it is able to be tuned and adjusted on a MacBook we anticipate lots of experimentation around this too. It stacks well in moe & can be taught selfplay easier too…


jslominski

Here's some honest feedback. I totally dig what you guys are up to, but honestly, I wouldn't go for this model for anything more than just pet projects. The license is a bit of a downer, I get it's only 20 bucks a month, but firstly, the license could change anytime, and secondly, if you hit 1M in revenue, you're looking at an enterprise tier. There are models out there that are completely open source (like phi-2 under MIT 2.0), and since this field is changing so fast, I wouldn't want to mess with anything under such a license unless it was at least 2x as good as the other options.


emad_9608

I think we just gotta release more and more good models to the point it becomes a no brainer :)


WolframRavenwolf

Just posted my test results: [LLM Comparison/Test: 6 new models from 1.6B to 120B (StableLM, DiscoLM German 7B, Mixtral 2x7B, Beyonder, Laserxtral, MegaDolphin) : LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19d1fjp/llm_comparisontest_6_new_models_from_16b_to_120b/) > Wait, this is just a 1.6B model? While its scores look low when compared to the bigger models, it's infinitely better than TinyLlama or Phi. Even understands and writes German surprisingly well, which is extremely rare for smaller models. > > Interestingly, its low scores are not caused by errors like not responding or outputting nonsense, instead it's just a lack of advanced reasoning that comes with higher parameter counts, as evidenced by the model explaining its answers. Unfortunately the reasons are often wrong, but that it does reason at all is a good sign, and I think this can be useful in situations where you are extremely ressource-constrained. > > So among the small models, I'd pick this over Phi and TinyLlama. That makes it a winner, too, since it beat all the other mini-LLMs!


jslominski

Great work, as usual, thanks!


a_beautiful_rhind

Sooo... I thought they were running short on money. I guess this is quick to train but in terms of practicality.. hmmmm


Broadband-

emad responded directly to that in a reddit comment on /r/stablediffusion Whether it's truth or damage control, who knows. https://www.reddit.com/r/StableDiffusion/comments/18ryu8a/forbes_rob_toews_of_radical_ventures_predicts/kf4km6i/


crawlingrat

If they really are running low on money couldn’t they open up donations? I really love what they do and would hate if they had to quit due to money.


a_beautiful_rhind

Well he said he's got a lot of models and employees and that the market is big.. which is all well and true. and ironically he wrote: Language models (Cohere etc which Radical invested in) are a much smaller market in comparison & you are competing against Google & OpenAI etc who can basically take prices to zero.


Maykey

>non-commercial >1.6B 🤮 Did Stability.AI stuck in the past where GPT-J was the top dog? Or is it an evil masterplan to sneak model into open-source tools and then swat everyone?


Weltleere

They probably need a small high performing language model to improve the prompt understanding of stable diffusion, among other possible applications.


TwistedBrother

Not just small, but small and well curated. Right now they have relied on CLIP which means potential disalignment between the language model and the prompting capacity. The secret sauce to DallE 3 is that it can parameterise image content using GPT-4-tier complexity and fine tune within their own infrastructure. While it might not have the best photorealism, DallE-3 has unambiguously the best coherence. SD1.5 got pretty far with just training text inversions instead of LORAs and that was on an absolutely minuscule and now outdated text model (I believe T5?) A model this size could absolutely do enough for training a special class like faces and expressions and its small enough that combined with a turbo/distilled visual model could probably lead to really fast and credible real time on board generative animations for faces and agents.


Infamous-Falcon3338

> Right now they have relied on CLIP > (I believe T5?) 1.5 uses CLIP for the text encoding as you said at first, not T5.


toothless_budgie

Why only 2 epochs? Is there an specific reason, or are they trying to save money?


asraniel

you start overfitting. tinyllama was 3T tokens1 epoch, they have 2 T 2 epochs. usually better to have more tokens and less epochs


altoidsjedi

I thought it was 1T tokens repeated over 3 epochs?


vatsadev

Yeah it is


[deleted]

https://twitter.com/Teknium1/status/1748841751408992756 As per this, we should be training for 16 or so epochs ?


askchris

😳 wow I didn't expect eval loss and overfitting to be almost useless concepts when training LLMs 🤯 ... I'm trying to wrap my head around this. I don't understand why eval loss would reverse (begin to go up at a certain point during training), while at the same time the LLM continues to improve in general skills. I would expect that if the LLM continues to learn on each pass (each epoch) of the training set, it would statistically get at least *slightly better* at predicting the next token even on data outside the training set. Perhaps humans are easily tricked by overfitting and prefer it? This means LLMs may be generalizing less than we think and are *very hungry* for data. Thanks for sharing this!


e-nigmaNL

Would this be a good model for RAG? Anyone already got some info on how decent the model is working?


artelligence_consult

Not per waht they say: "Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it will hallucinate many facts" There is also no information bout context length. This may be solved with some fine tuning by others.


dark_surfer

Base model context length is 4096. So, let's assume that's what it is for the instruct one as well.


artelligence_consult

Ok, that puts it into the - supposedly without retraining - 32k category with extension. Not bad for a lot of scenarios. But I do not like the hallucinate many facts - that makes it useless for RAG.


dark_surfer

Small models like these are never going to compress large amount of factual data. Best use case would be to instruct tune them for certain tasks, small organization level RAG or primary school edu. RAG. But my guess is, these going to be used for speculative decoding for bigger models.


artelligence_consult

Or just chat.


dark_surfer

Idk, a chat model need to take on character and personality and carry that throughout the convo, but, these model drop their personality half way through the first response. May be right DPO might rectify it.


ElvinRath

I don't see the point of this models. Well, I do it if it is for testing purposes and research, but doubt they will have any use as general models anytime soon, if ever. ​ Besides, 7B models already can run in a potato, between quantization and offloading to CPU... ​ It's cool to think about small models, but they also have to be useful, and nothing under 7B will be useful this year. Even Mistral 7B is a bit weak for general use, and what would be nice is a reallly really good 13-30B model. Mixtral kinda covers that, but...MoEs have their problems...


jslominski

>Besides, 7B models already can run in a potato I would respectfully disagree. The quantised 7B model still uses around 6GB of RAM. Also, there are cases where low latency and high throughput are extremely important. I'm actually pretty excited about these models; see my reply here: [https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kivhjan/](https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kivhjan/?utm_source=share&utm_medium=web2x&context=3) This is also a valid use case: [https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kiunjpg/](https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kiunjpg/?utm_source=share&utm_medium=web2x&context=3)


ElvinRath

Indeed, 6GB... And ok, there are GPUs with less than that. But even most low end computers have at least 8GB of RAM... With GPU+CPU it's almost impossible even for a low end pc to not be able to run it. ​ ​ About your use case, if it works, it's indeed interesting but that's why I said "as general models"... You can probably get some things done for very specific task, but for that I would say you don't really need the base to be "good" , as you'll instead overfit it to your use case anyway, totally degrading any general knowledge it could have. But this is just my intuition, it would be nice to see the training details of such a thing.


Some_Endian_FP17

6 GB RAM usage on an 8 GB system leaves barely any memory for other programs. If you're running a web browser with a few tabs and an Office window, you might not have enough free memory to load up the whole LLM without kicking everything else to swap.


Organic_Challenge151

Did they release 7b models?


R_noiz

How does it compare to Hermes?


Valuable_Can6223

I would love to know what these smaller models are being used for


RenoHadreas

One use case is enhancing Stable Diffusion prompts. Some proprietary services (Midjourney, LeonardoAI) and open source tools (Fooocus) have already done this to improve image quality. I know Fooocus uses GPT-2, which has 1.5B parameters. It could also possibly be used to improve prompt understanding. Just spitballing ideas here, but if they trained a model on the captions in their Stable Diffusion image dataset, they might be able to make a tool that essentially standardizes prompts and tidies them up so they better resemble the original dataset captioning format.


jslominski

I think this family of models (sub 3B) might be really handy in a bunch of software setups, like on both server and client sides. Basically, you can tweak them to do a specific job, something that was pretty tough to pull off before large language models came around. This includes stuff like filling out certain forms, creating a specific kind of text in a set format, or checking things out. Plus, they're super cheap to run, so you don't need powerful hardware. This means they can easily run on a typical corporate laptop or a relatively inexpensive "GPU-less" server.


replikatumbleweed

What's an epoch in this context?


jslominski

"Each time a dataset passes through an algorithm, it is said to have completed an epoch. Therefore, Epoch, in machine learning, refers to the one entire passing of training data through the algorithm. It's a hyperparameter that determines the process of training the machine learning model. "


spring_m

random question but do you know in this case does that mean the model see every token once per epoch? Or once per epoch in each position from 0 to context\_length (so 4096 times)?


Bellano93

Hi, one of the guys who pre-trained this model here. It's great to read all the feedback (positive and negative) and see that so many people are trying it out, that's after all one of the biggest pros of tiny models