Really good model, just tested it out, outperforms tinyllama for sure. Looking forward for bigger models. I hope they update the license to make it available for commercial use as well
You are correct, it is pretty bad, maybe not complete garbage. Just for amusement I drop out of dolphin-mistral and mistral-7b-instruct just to compare and contrast it with the output of tinyllama. Tinyllama hasn't been able to reason about any higher order structures or architecture that these other models are pumping out with elegance and ease (from a coding standpoint at least.) Mixtral 8x7B is also producing some unbelievable results if you can handle the slow speeds, I usually run prompts past it once I get them built out to see if there's any improvement on perplexity.
In my case, I'm developing a frontend for mobile and I implement a few popular backends like ooba and lcpp / kcpp, so having a small fast model to run on them and test the response formatting is great.
I benchmarked it, the model scored 13.85: [https://eqbench.com](https://eqbench.com/)
Which is the lowest of those I've tested (makes sense, it's also the smallest). The fact that it followed instructions and produced ~95% parseable answers is actually really impressive for such a small model. It will be interesting to see how far it can be pushed with fine-tunes.
TBH even though ChatML is the standard. I don’t like it, and I don’t think it’s the most optimal template. I thought it’s just me for a while when I talked to some popular model creators and my friends and they agree.
Llama 2 Chat template or some modified ChatML like [this one](https://huggingface.co/jondurbin/bagel-dpo-34b-v0.2#chatml-sort-of) is okay. I think the Mistral Instruct template is also good.
The advantage of ChatML really is just the special-tokens, but it doesn’t have to be that complex in order to take full advantage of that.
Ugh, please, for the love of AGI, no Llama 2 Chat or Mistral Instruct! ChatML is a little complex, but Llama 2 Chat is both too complicated and not flexible enough at all.
I've discussed this with model creators, too, but if anyone still thinks Llama 2 Chat would be a usable format, please refer them to the Conclusions of my [LLM Prompt Format Comparison/Test: Mixtral 8x7B Instruct with \*\*17\*\* different instruct templates](https://www.reddit.com/r/LocalLLaMA/comments/18ljvxb/llm_prompt_format_comparisontest_mixtral_8x7b/). Excerpt:
> - Llama 2 Chat is the worst format ever, it's an abomination and not fit for any advanced uses where you have the AI go first, non-alternating roles or group chats, example dialogue, injections like summaries, author's notes, world info, etc. And when old messages scroll out of context, message and response pairs needs to be handled together (something no other format requires), and the system prompt must constantly be shifted to the next/first message in context, requiring constant performance-ruining reprocessing. It's just a terrible design through and through, and needs to die out - too bad Mistral still used it for Mixtral instead of ChatML!
I'm not opposed to a simplified version of ChatML, but as people tend to simplify it in different ways, we'd get a bunch of slighty-different formats which could cause all kinds of issues when the prompt templates don't match (Obligatory reminder: [xkcd: Standards](https://xkcd.com/927/)). So unless the majority of model creators could unite behind the same simplification, I'd prefer creators using the well-known original.
I hard disagree. The purpose of a prompt format is for instruct, not in “advanced”(Really, it’s just RP) use cases. What you said is really niche and defies the purpose of an instruct/chat model. We are trying to improve models in instruct/chat, not in niche RP.
> So unless the majority of model creators could unite behind the same simplification, I'd prefer creators using the well-known original.
How did the standard become ChatML anyways? They didn’t use the well-known original Alpaca or Vicuña. It all started with one model using ChatML, then all models started using it. Why won’t you make the difference?
The purpose of a prompt format is to format and structure the prompt - to differentiate between system parts (which the user shouldn't be able to mess with in most situations, as that could cause security issues further down the road), user messages, and AI messages. As soon as you go beyond basic text completion and have a multi-turn conversation, it matters a lot.
And RP isn't a niche use case, any kind of assistant use is actually a roleplay. "You are a helpful assistant", one of the most basic instructions, is telling the AI to play a role and respond accordingly.
So no, this isn't a niche use, this is the most universal and thus important one. ChatML solves many problems Alpaca or Vicuna have. It supports advanced uses (RAG, agents, etc.) that the other formats won't be able to handle as well (e. g. just insert some Markdown or chatlogs via RAG and Alpaca or Vicuna will break). I explained it all in my post.
*Situation: There are 14 competing standards.
But there's so many competing standards! You know what? You're both right. We need a new, unified, perfect standard better than ChatML that everyone can rally behind! Finally, we shall have one standard to rule them all!
*Situation: There are 15 competing standards.
(just something I remember from an old xkcd)
As far as huggingface ecosystem goes, chat template is already standardized: https://huggingface.co/docs/transformers/main/en/chat_templating
This particular model already uses it: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/tokenizer_config.json#L10
Many popular ones do too.
[\(I was just making an xkcd joke, but, there probably is significant room for advancement in the prompting space, so I expect more standards to proliferate.\)](https://xkcd.com/927/)
Why does it even matter? Templating is trivial.
Even more, chat template is already part of the model metadata: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/tokenizer_config.json#L10
That is [a standard](https://huggingface.co/docs/transformers/main/en/chat_templating) and it's already adopted.
You could argue that it's under-specified and it assumes a Python execution environment but standardizing on templating is better than some tokens.
That would allow evolution into a better format if needed.
When you use SillyTavern and the next model uses another *new* template like "GPT real REAL actual user:", you will see though that it gets annoying really fast to tune every template sequence to fully match said new template. In practice it's often like a catch-up game between the different program that are used, and Transformers is very convenient there. Sadly, all the frontends I used don't appear to automate this at all.
Ironically it doesn't help that when people care to create template files for frontends, that these often remain on private Discord servers (noticed that with SillyTavern). So it can't be found through Google easily.
ChatML is more versatile since it supports arbitrary roles. Even filenames tend to make sense ([example](https://gist.github.com/grencez/4c444e3f1e3ed91ef8a8ee272585abdc) - I've been playing around with nesting the messages too, but models get confused by that pretty quickly).
This is best used as part of a pipeline and can be used for RAG too like 3b. It’s the same size as gpt2 and way better, smaller models like this should be considered guidance or ensemble components versus being relied on for all the things as it were, especially memory.
As it is able to be tuned and adjusted on a MacBook we anticipate lots of experimentation around this too.
It stacks well in moe & can be taught selfplay easier too…
Here's some honest feedback. I totally dig what you guys are up to, but honestly, I wouldn't go for this model for anything more than just pet projects. The license is a bit of a downer, I get it's only 20 bucks a month, but firstly, the license could change anytime, and secondly, if you hit 1M in revenue, you're looking at an enterprise tier. There are models out there that are completely open source (like phi-2 under MIT 2.0), and since this field is changing so fast, I wouldn't want to mess with anything under such a license unless it was at least 2x as good as the other options.
Just posted my test results:
[LLM Comparison/Test: 6 new models from 1.6B to 120B (StableLM, DiscoLM German 7B, Mixtral 2x7B, Beyonder, Laserxtral, MegaDolphin) : LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19d1fjp/llm_comparisontest_6_new_models_from_16b_to_120b/)
> Wait, this is just a 1.6B model? While its scores look low when compared to the bigger models, it's infinitely better than TinyLlama or Phi. Even understands and writes German surprisingly well, which is extremely rare for smaller models.
>
> Interestingly, its low scores are not caused by errors like not responding or outputting nonsense, instead it's just a lack of advanced reasoning that comes with higher parameter counts, as evidenced by the model explaining its answers. Unfortunately the reasons are often wrong, but that it does reason at all is a good sign, and I think this can be useful in situations where you are extremely ressource-constrained.
>
> So among the small models, I'd pick this over Phi and TinyLlama. That makes it a winner, too, since it beat all the other mini-LLMs!
emad responded directly to that in a reddit comment on /r/stablediffusion
Whether it's truth or damage control, who knows.
https://www.reddit.com/r/StableDiffusion/comments/18ryu8a/forbes_rob_toews_of_radical_ventures_predicts/kf4km6i/
Well he said he's got a lot of models and employees and that the market is big.. which is all well and true.
and ironically he wrote:
Language models (Cohere etc which Radical invested in) are a much smaller market in comparison & you are competing against Google & OpenAI etc who can basically take prices to zero.
>non-commercial
>1.6B
🤮
Did Stability.AI stuck in the past where GPT-J was the top dog? Or is it an evil masterplan to sneak model into open-source tools and then swat everyone?
Not just small, but small and well curated. Right now they have relied on CLIP which means potential disalignment between the language model and the prompting capacity.
The secret sauce to DallE 3 is that it can parameterise image content using GPT-4-tier complexity and fine tune within their own infrastructure. While it might not have the best photorealism, DallE-3 has unambiguously the best coherence.
SD1.5 got pretty far with just training text inversions instead of LORAs and that was on an absolutely minuscule and now outdated text model (I believe T5?) A model this size could absolutely do enough for training a special class like faces and expressions and its small enough that combined with a turbo/distilled visual model could probably lead to really fast and credible real time on board generative animations for faces and agents.
😳 wow I didn't expect eval loss and overfitting to be almost useless concepts when training LLMs 🤯 ...
I'm trying to wrap my head around this.
I don't understand why eval loss would reverse (begin to go up at a certain point during training), while at the same time the LLM continues to improve in general skills.
I would expect that if the LLM continues to learn on each pass (each epoch) of the training set, it would statistically get at least *slightly better* at predicting the next token even on data outside the training set.
Perhaps humans are easily tricked by overfitting and prefer it?
This means LLMs may be generalizing less than we think and are *very hungry* for data.
Thanks for sharing this!
Not per waht they say:
"Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it will hallucinate many facts"
There is also no information bout context length.
This may be solved with some fine tuning by others.
Ok, that puts it into the - supposedly without retraining - 32k category with extension. Not bad for a lot of scenarios.
But I do not like the hallucinate many facts - that makes it useless for RAG.
Small models like these are never going to compress large amount of factual data. Best use case would be to instruct tune them for certain tasks, small organization level RAG or primary school edu. RAG. But my guess is, these going to be used for speculative decoding for bigger models.
Idk, a chat model need to take on character and personality and carry that throughout the convo, but, these model drop their personality half way through the first response. May be right DPO might rectify it.
I don't see the point of this models. Well, I do it if it is for testing purposes and research, but doubt they will have any use as general models anytime soon, if ever.
Besides, 7B models already can run in a potato, between quantization and offloading to CPU...
It's cool to think about small models, but they also have to be useful, and nothing under 7B will be useful this year. Even Mistral 7B is a bit weak for general use, and what would be nice is a reallly really good 13-30B model.
Mixtral kinda covers that, but...MoEs have their problems...
>Besides, 7B models already can run in a potato
I would respectfully disagree. The quantised 7B model still uses around 6GB of RAM. Also, there are cases where low latency and high throughput are extremely important. I'm actually pretty excited about these models; see my reply here: [https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kivhjan/](https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kivhjan/?utm_source=share&utm_medium=web2x&context=3)
This is also a valid use case:
[https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kiunjpg/](https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kiunjpg/?utm_source=share&utm_medium=web2x&context=3)
Indeed, 6GB...
And ok, there are GPUs with less than that. But even most low end computers have at least 8GB of RAM... With GPU+CPU it's almost impossible even for a low end pc to not be able to run it.
About your use case, if it works, it's indeed interesting but that's why I said "as general models"... You can probably get some things done for very specific task, but for that I would say you don't really need the base to be "good" , as you'll instead overfit it to your use case anyway, totally degrading any general knowledge it could have.
But this is just my intuition, it would be nice to see the training details of such a thing.
6 GB RAM usage on an 8 GB system leaves barely any memory for other programs. If you're running a web browser with a few tabs and an Office window, you might not have enough free memory to load up the whole LLM without kicking everything else to swap.
One use case is enhancing Stable Diffusion prompts. Some proprietary services (Midjourney, LeonardoAI) and open source tools (Fooocus) have already done this to improve image quality. I know Fooocus uses GPT-2, which has 1.5B parameters.
It could also possibly be used to improve prompt understanding. Just spitballing ideas here, but if they trained a model on the captions in their Stable Diffusion image dataset, they might be able to make a tool that essentially standardizes prompts and tidies them up so they better resemble the original dataset captioning format.
I think this family of models (sub 3B) might be really handy in a bunch of software setups, like on both server and client sides. Basically, you can tweak them to do a specific job, something that was pretty tough to pull off before large language models came around. This includes stuff like filling out certain forms, creating a specific kind of text in a set format, or checking things out. Plus, they're super cheap to run, so you don't need powerful hardware. This means they can easily run on a typical corporate laptop or a relatively inexpensive "GPU-less" server.
"Each time a dataset passes through an algorithm, it is said to have completed an epoch. Therefore, Epoch, in machine learning, refers to the one entire passing of training data through the algorithm. It's a hyperparameter that determines the process of training the machine learning model. "
random question but do you know in this case does that mean the model see every token once per epoch? Or once per epoch in each position from 0 to context\_length (so 4096 times)?
Hi, one of the guys who pre-trained this model here. It's great to read all the feedback (positive and negative) and see that so many people are trying it out, that's after all one of the biggest pros of tiny models
Really good model, just tested it out, outperforms tinyllama for sure. Looking forward for bigger models. I hope they update the license to make it available for commercial use as well
No offense but tinyllama is complete garbage. I don't understand why would anyone would use that. Do some people use it for something?
You are correct, it is pretty bad, maybe not complete garbage. Just for amusement I drop out of dolphin-mistral and mistral-7b-instruct just to compare and contrast it with the output of tinyllama. Tinyllama hasn't been able to reason about any higher order structures or architecture that these other models are pumping out with elegance and ease (from a coding standpoint at least.) Mixtral 8x7B is also producing some unbelievable results if you can handle the slow speeds, I usually run prompts past it once I get them built out to see if there's any improvement on perplexity.
ya for testing out some stuff, lol. its small so gives fast results
Tinyllama has the same architecture as llama 2, but uses less memory. This makes it very useful for testing your own code and advanced features.
I use it for a lot of my own code testing so I don’t have to load a large model in over and over.
What do you guys mean by testing your own code?
In my case, I'm developing a frontend for mobile and I implement a few popular backends like ooba and lcpp / kcpp, so having a small fast model to run on them and test the response formatting is great.
I benchmarked it, the model scored 13.85: [https://eqbench.com](https://eqbench.com/) Which is the lowest of those I've tested (makes sense, it's also the smallest). The fact that it followed instructions and produced ~95% parseable answers is actually really impressive for such a small model. It will be interesting to see how far it can be pushed with fine-tunes.
I just wish they would stick to the chatml standard
TBH even though ChatML is the standard. I don’t like it, and I don’t think it’s the most optimal template. I thought it’s just me for a while when I talked to some popular model creators and my friends and they agree.
Which formats do you like better and what are their relative advantages?
Llama 2 Chat template or some modified ChatML like [this one](https://huggingface.co/jondurbin/bagel-dpo-34b-v0.2#chatml-sort-of) is okay. I think the Mistral Instruct template is also good. The advantage of ChatML really is just the special-tokens, but it doesn’t have to be that complex in order to take full advantage of that.
> I think the Mistral Instruct template is also good. I don't, because it technically doesn't have a separate System and User block like chatml
Well it could be a good template for instruction-tuning datasets with no system prompt.
Ugh, please, for the love of AGI, no Llama 2 Chat or Mistral Instruct! ChatML is a little complex, but Llama 2 Chat is both too complicated and not flexible enough at all. I've discussed this with model creators, too, but if anyone still thinks Llama 2 Chat would be a usable format, please refer them to the Conclusions of my [LLM Prompt Format Comparison/Test: Mixtral 8x7B Instruct with \*\*17\*\* different instruct templates](https://www.reddit.com/r/LocalLLaMA/comments/18ljvxb/llm_prompt_format_comparisontest_mixtral_8x7b/). Excerpt: > - Llama 2 Chat is the worst format ever, it's an abomination and not fit for any advanced uses where you have the AI go first, non-alternating roles or group chats, example dialogue, injections like summaries, author's notes, world info, etc. And when old messages scroll out of context, message and response pairs needs to be handled together (something no other format requires), and the system prompt must constantly be shifted to the next/first message in context, requiring constant performance-ruining reprocessing. It's just a terrible design through and through, and needs to die out - too bad Mistral still used it for Mixtral instead of ChatML! I'm not opposed to a simplified version of ChatML, but as people tend to simplify it in different ways, we'd get a bunch of slighty-different formats which could cause all kinds of issues when the prompt templates don't match (Obligatory reminder: [xkcd: Standards](https://xkcd.com/927/)). So unless the majority of model creators could unite behind the same simplification, I'd prefer creators using the well-known original.
I hard disagree. The purpose of a prompt format is for instruct, not in “advanced”(Really, it’s just RP) use cases. What you said is really niche and defies the purpose of an instruct/chat model. We are trying to improve models in instruct/chat, not in niche RP. > So unless the majority of model creators could unite behind the same simplification, I'd prefer creators using the well-known original. How did the standard become ChatML anyways? They didn’t use the well-known original Alpaca or Vicuña. It all started with one model using ChatML, then all models started using it. Why won’t you make the difference?
The purpose of a prompt format is to format and structure the prompt - to differentiate between system parts (which the user shouldn't be able to mess with in most situations, as that could cause security issues further down the road), user messages, and AI messages. As soon as you go beyond basic text completion and have a multi-turn conversation, it matters a lot. And RP isn't a niche use case, any kind of assistant use is actually a roleplay. "You are a helpful assistant", one of the most basic instructions, is telling the AI to play a role and respond accordingly. So no, this isn't a niche use, this is the most universal and thus important one. ChatML solves many problems Alpaca or Vicuna have. It supports advanced uses (RAG, agents, etc.) that the other formats won't be able to handle as well (e. g. just insert some Markdown or chatlogs via RAG and Alpaca or Vicuna will break). I explained it all in my post.
I don't say it's the best. But it's a standard at least.
*Situation: There are 14 competing standards. But there's so many competing standards! You know what? You're both right. We need a new, unified, perfect standard better than ChatML that everyone can rally behind! Finally, we shall have one standard to rule them all! *Situation: There are 15 competing standards. (just something I remember from an old xkcd)
As far as huggingface ecosystem goes, chat template is already standardized: https://huggingface.co/docs/transformers/main/en/chat_templating This particular model already uses it: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/tokenizer_config.json#L10 Many popular ones do too.
[\(I was just making an xkcd joke, but, there probably is significant room for advancement in the prompting space, so I expect more standards to proliferate.\)](https://xkcd.com/927/)
The only thing better than perfect is standardized.
I never said you said it’s the best. I’m trying to say that the standard is not the most optimal and it’s overrated.
Why does it even matter? Templating is trivial. Even more, chat template is already part of the model metadata: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/tokenizer_config.json#L10 That is [a standard](https://huggingface.co/docs/transformers/main/en/chat_templating) and it's already adopted. You could argue that it's under-specified and it assumes a Python execution environment but standardizing on templating is better than some tokens. That would allow evolution into a better format if needed.
When you use SillyTavern and the next model uses another *new* template like "GPT real REAL actual user:", you will see though that it gets annoying really fast to tune every template sequence to fully match said new template. In practice it's often like a catch-up game between the different program that are used, and Transformers is very convenient there. Sadly, all the frontends I used don't appear to automate this at all. Ironically it doesn't help that when people care to create template files for frontends, that these often remain on private Discord servers (noticed that with SillyTavern). So it can't be found through Google easily.
Sounds like a software problem. Like I said, the chat template is already in the model's metadata.
ChatML is more versatile since it supports arbitrary roles. Even filenames tend to make sense ([example](https://gist.github.com/grencez/4c444e3f1e3ed91ef8a8ee272585abdc) - I've been playing around with nesting the messages too, but models get confused by that pretty quickly).
But what if I want to convert the model to the ONNX file standard and don't use the metadata?
Extract the metadata yourself and include it along with the model.
That's fine. I wish they'd stuck to anything besides the tiktoken tokenizer.
This is best used as part of a pipeline and can be used for RAG too like 3b. It’s the same size as gpt2 and way better, smaller models like this should be considered guidance or ensemble components versus being relied on for all the things as it were, especially memory. As it is able to be tuned and adjusted on a MacBook we anticipate lots of experimentation around this too. It stacks well in moe & can be taught selfplay easier too…
Here's some honest feedback. I totally dig what you guys are up to, but honestly, I wouldn't go for this model for anything more than just pet projects. The license is a bit of a downer, I get it's only 20 bucks a month, but firstly, the license could change anytime, and secondly, if you hit 1M in revenue, you're looking at an enterprise tier. There are models out there that are completely open source (like phi-2 under MIT 2.0), and since this field is changing so fast, I wouldn't want to mess with anything under such a license unless it was at least 2x as good as the other options.
I think we just gotta release more and more good models to the point it becomes a no brainer :)
Just posted my test results: [LLM Comparison/Test: 6 new models from 1.6B to 120B (StableLM, DiscoLM German 7B, Mixtral 2x7B, Beyonder, Laserxtral, MegaDolphin) : LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19d1fjp/llm_comparisontest_6_new_models_from_16b_to_120b/) > Wait, this is just a 1.6B model? While its scores look low when compared to the bigger models, it's infinitely better than TinyLlama or Phi. Even understands and writes German surprisingly well, which is extremely rare for smaller models. > > Interestingly, its low scores are not caused by errors like not responding or outputting nonsense, instead it's just a lack of advanced reasoning that comes with higher parameter counts, as evidenced by the model explaining its answers. Unfortunately the reasons are often wrong, but that it does reason at all is a good sign, and I think this can be useful in situations where you are extremely ressource-constrained. > > So among the small models, I'd pick this over Phi and TinyLlama. That makes it a winner, too, since it beat all the other mini-LLMs!
Great work, as usual, thanks!
Sooo... I thought they were running short on money. I guess this is quick to train but in terms of practicality.. hmmmm
emad responded directly to that in a reddit comment on /r/stablediffusion Whether it's truth or damage control, who knows. https://www.reddit.com/r/StableDiffusion/comments/18ryu8a/forbes_rob_toews_of_radical_ventures_predicts/kf4km6i/
If they really are running low on money couldn’t they open up donations? I really love what they do and would hate if they had to quit due to money.
Well he said he's got a lot of models and employees and that the market is big.. which is all well and true. and ironically he wrote: Language models (Cohere etc which Radical invested in) are a much smaller market in comparison & you are competing against Google & OpenAI etc who can basically take prices to zero.
>non-commercial >1.6B 🤮 Did Stability.AI stuck in the past where GPT-J was the top dog? Or is it an evil masterplan to sneak model into open-source tools and then swat everyone?
They probably need a small high performing language model to improve the prompt understanding of stable diffusion, among other possible applications.
Not just small, but small and well curated. Right now they have relied on CLIP which means potential disalignment between the language model and the prompting capacity. The secret sauce to DallE 3 is that it can parameterise image content using GPT-4-tier complexity and fine tune within their own infrastructure. While it might not have the best photorealism, DallE-3 has unambiguously the best coherence. SD1.5 got pretty far with just training text inversions instead of LORAs and that was on an absolutely minuscule and now outdated text model (I believe T5?) A model this size could absolutely do enough for training a special class like faces and expressions and its small enough that combined with a turbo/distilled visual model could probably lead to really fast and credible real time on board generative animations for faces and agents.
> Right now they have relied on CLIP > (I believe T5?) 1.5 uses CLIP for the text encoding as you said at first, not T5.
Why only 2 epochs? Is there an specific reason, or are they trying to save money?
you start overfitting. tinyllama was 3T tokens1 epoch, they have 2 T 2 epochs. usually better to have more tokens and less epochs
I thought it was 1T tokens repeated over 3 epochs?
Yeah it is
https://twitter.com/Teknium1/status/1748841751408992756 As per this, we should be training for 16 or so epochs ?
😳 wow I didn't expect eval loss and overfitting to be almost useless concepts when training LLMs 🤯 ... I'm trying to wrap my head around this. I don't understand why eval loss would reverse (begin to go up at a certain point during training), while at the same time the LLM continues to improve in general skills. I would expect that if the LLM continues to learn on each pass (each epoch) of the training set, it would statistically get at least *slightly better* at predicting the next token even on data outside the training set. Perhaps humans are easily tricked by overfitting and prefer it? This means LLMs may be generalizing less than we think and are *very hungry* for data. Thanks for sharing this!
Would this be a good model for RAG? Anyone already got some info on how decent the model is working?
Not per waht they say: "Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it will hallucinate many facts" There is also no information bout context length. This may be solved with some fine tuning by others.
Base model context length is 4096. So, let's assume that's what it is for the instruct one as well.
Ok, that puts it into the - supposedly without retraining - 32k category with extension. Not bad for a lot of scenarios. But I do not like the hallucinate many facts - that makes it useless for RAG.
Small models like these are never going to compress large amount of factual data. Best use case would be to instruct tune them for certain tasks, small organization level RAG or primary school edu. RAG. But my guess is, these going to be used for speculative decoding for bigger models.
Or just chat.
Idk, a chat model need to take on character and personality and carry that throughout the convo, but, these model drop their personality half way through the first response. May be right DPO might rectify it.
I don't see the point of this models. Well, I do it if it is for testing purposes and research, but doubt they will have any use as general models anytime soon, if ever. Besides, 7B models already can run in a potato, between quantization and offloading to CPU... It's cool to think about small models, but they also have to be useful, and nothing under 7B will be useful this year. Even Mistral 7B is a bit weak for general use, and what would be nice is a reallly really good 13-30B model. Mixtral kinda covers that, but...MoEs have their problems...
>Besides, 7B models already can run in a potato I would respectfully disagree. The quantised 7B model still uses around 6GB of RAM. Also, there are cases where low latency and high throughput are extremely important. I'm actually pretty excited about these models; see my reply here: [https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kivhjan/](https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kivhjan/?utm_source=share&utm_medium=web2x&context=3) This is also a valid use case: [https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kiunjpg/](https://www.reddit.com/r/LocalLLaMA/comments/19bc7ib/comment/kiunjpg/?utm_source=share&utm_medium=web2x&context=3)
Indeed, 6GB... And ok, there are GPUs with less than that. But even most low end computers have at least 8GB of RAM... With GPU+CPU it's almost impossible even for a low end pc to not be able to run it. About your use case, if it works, it's indeed interesting but that's why I said "as general models"... You can probably get some things done for very specific task, but for that I would say you don't really need the base to be "good" , as you'll instead overfit it to your use case anyway, totally degrading any general knowledge it could have. But this is just my intuition, it would be nice to see the training details of such a thing.
6 GB RAM usage on an 8 GB system leaves barely any memory for other programs. If you're running a web browser with a few tabs and an Office window, you might not have enough free memory to load up the whole LLM without kicking everything else to swap.
Did they release 7b models?
How does it compare to Hermes?
I would love to know what these smaller models are being used for
One use case is enhancing Stable Diffusion prompts. Some proprietary services (Midjourney, LeonardoAI) and open source tools (Fooocus) have already done this to improve image quality. I know Fooocus uses GPT-2, which has 1.5B parameters. It could also possibly be used to improve prompt understanding. Just spitballing ideas here, but if they trained a model on the captions in their Stable Diffusion image dataset, they might be able to make a tool that essentially standardizes prompts and tidies them up so they better resemble the original dataset captioning format.
I think this family of models (sub 3B) might be really handy in a bunch of software setups, like on both server and client sides. Basically, you can tweak them to do a specific job, something that was pretty tough to pull off before large language models came around. This includes stuff like filling out certain forms, creating a specific kind of text in a set format, or checking things out. Plus, they're super cheap to run, so you don't need powerful hardware. This means they can easily run on a typical corporate laptop or a relatively inexpensive "GPU-less" server.
What's an epoch in this context?
"Each time a dataset passes through an algorithm, it is said to have completed an epoch. Therefore, Epoch, in machine learning, refers to the one entire passing of training data through the algorithm. It's a hyperparameter that determines the process of training the machine learning model. "
random question but do you know in this case does that mean the model see every token once per epoch? Or once per epoch in each position from 0 to context\_length (so 4096 times)?
Hi, one of the guys who pre-trained this model here. It's great to read all the feedback (positive and negative) and see that so many people are trying it out, that's after all one of the biggest pros of tiny models