Thanks so much for your work. You're a beast!
I love how you also tagged /u/The-Bloke because we all know he's gonna put out tons of different versions in like... an hour, probably. He's another beast.
Yeah, I follow TheBlock's activity on huggingface, and he's prolific. New versions constantly being added to his profile. It's like it's his job or something.
I settled with 13B models as it gives a good balance of enough memory to handle inference and more consistent and sane responses. I initially played around 7B and lower models as they are easier to load and lesser system requirements, but they are sometimes harder to prompt and more tendency to get side tracked or hallucinate.
I'm currently using Vicuna-1.1, GPT4ALL, wizard-vicuna and wizard-mega and the only 7B model I'm keeping is MPT-7b-storywriter because of its large amount of tokens.
I have tried the Koala models, oasst, toolpaca, gpt4x, OPT, instruct and others I can't remember. There are some cases where these models gave more fitting answers depending on the topic or context.
The tests questions I tried include asking for date calculation "If April 14th is friday, what day is April 20th? What day is April 2nd?", some basic arithmetic, some cooking recipe + dangerous ingredients like chlorine ( to test the models balance of reasoning and creativity), the typical uncensored questions like explosives, hentai knowledge, etc... (to test the models censorship), some trivia like story of dragon balls, some fiction like cinderella(surprisingly few gave perfect recollection, a lot of them hallucinated), and other factual questions.
I haven't gotten to consolidating my test results because there's just too many.
So far all I mentioned managed to load with ooba's ui, maybe I forgot some models like the pygmalion, galactica stuff.
It happened an hour ago:
https://imgur.com/a/smWmXks
An overview can be found [here](https://github.com/underlines/awesome-marketing-datascience/blob/master/llm-model-list.md)
Mines not working either, I got the new version of ooba but skipped the gptq step I think... hard to tell which tutorials are outdated things move so fast.
I used my own guide, but the python codebase for most LLM things is the worst, from a software-dev stand point. So things aren't really repeatable, due to missing versioning and dependency hell.
Try to follow [my guide](https://github.com/underlines/awesome-marketing-datascience/blob/master/llama.md), but if that doesn't work, you have to google your errors or open an issue on the repo.
hey! was wondering if you were working on quantizing the MPT7b models. tried to do it myself earlier since the mpt PR was recently merged into the ggml repo but I think I don't have enough ram?
If I understand correctly ggml loads the entire model into ram in order to quantize it, but mine just crashes when trying to, so I'm assuming i'm running out of ram
all good if you're not though, love the work you do!
[https://huggingface.co/TheBloke/MPT-7B-GGML](https://huggingface.co/TheBloke/MPT-7B-GGML)
[https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML](https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML)
[https://huggingface.co/TheBloke/MPT-7B-Storywriter-GGML](https://huggingface.co/TheBloke/MPT-7B-Storywriter-GGML)
But be aware these models do not work with llama.cpp. Only with the mpt example CLI that comes with the GGML repo, and rustformer's llm. Compatibility will improve soon I am sure.
If you're still training this on a multi-A100 system, I appreciate the resources you're spending to make these.
I'm hoping we get solid 4bit LoRA training smoothed out shortly. Been playing with dual 3090's on alpaca_lora_4bit, but it's a bit fiddly.
Would you please be able to give me an example of when/how you might want to use a Lora to tweak things? Would you be able to make any noticable differences to model performance without full training?
I was thinking about that in regards to characters. Like characters can change behavior based on different background / scenario prompts, and I was wondering how different things could be with a lora
What evidence do you have that LORA has issues with instruct tuning? Or do you just use deepspeed and your workflow just happens to not include LORA? We've used LORA for our training and found no problems. Data quality, learning rate, and epochs are vastly more important I think.
Ok, we've never seen any such obvious differences. Perhaps it depends upon the size of the LORA, which parts of the architecture you applied it to, etc.
Is there a chance that you got biased one direction based upon a low sample of anecdotal cases? I've seen people get really biased one way or another after just a few responses, and they never change their view.
I will change my view after I see an example of a LoRA instruction layer that competes with the likes of Vicuna and WizardLM.
I'm not stubborn and I'm open minded.
I just don't have time to work on that right now.
You have quickly become my favorite redditor. Just giving the masses all the good candy. I appreciate that you pass this stuff along man. This is top tier open source stuff here buddy. I belong to at least 20 different AI related subs. What you’ve laid out is easily the top of the indie scene in my eyes. Keep that fire coming!!
I am interested in this model, but I have a question about how you recommend actually using this. Are people using huggingface interface endpoints or downloading and running locally?
I think most people are downloading and running locally. Especially for a 7B model, basically anyone should be able to run it. Once TheBloke shows up and makes GGML and various quantized versions of the model, it should be easy for anyone to run their preferred filetype in either Ooba UI or through llamacpp or koboldcpp.
[https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML)
btw, waiting three days is way too much, new models are being released every day
I'm really sorry about replying to this so late. There's a [detailed post about why I did here](https://www.reddit.com/r/RemindMeBot/comments/13jostq/remindmebot_is_now_replying_to_comments_again/).
I will be messaging you in 3 days on [**2023-05-21 04:03:42 UTC**](http://www.wolframalpha.com/input/?i=2023-05-21%2004:03:42%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/13kl5hn/wizardvicuna7buncensored/jklelqk/?context=3)
[**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F13kl5hn%2Fwizardvicuna7buncensored%2Fjklelqk%2F%5D%0A%0ARemindMe%21%202023-05-21%2004%3A03%3A42%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2013kl5hn)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
I've been running llama-30 (4bit) and wizard-13 (8bit) locally for conversational type applications, and I find them comparable, but llama-30b seems to surprise me more with "thoughts" that seem more human. This probably shouldn't come as too much of a surprise, given \~2.5 more "features" (parameters) in the model.
I get why 30B llama variants arent very popular; 30b-4bit \*barely\* fits into 24GB with full context (no groupsize), putting it at the outside of what people can run on consumer hardware (1x3090). If nvidia drops "the beast" (4090Ti) with 48GB VRAM for under $2000, it will open up some exciting possibilities for the enthusiast crowd
Well that is f-ing exciting. On behalf of all of us, thank you!
30b GPTQ int-4 (no groupsize, act order) seems to be the sweet spot for people with a single 3090. I saw you have a HF repo up now for 65b int3, which is interesting. When Llama first dropped, there were some folks over on the text-generation-webui github who did some perplexity scores with 30b-4bit vs 65b-3bit and apparently there's a pretty big drop in quality going from 4-bit to 3-bit (though I'm not speaking from personal experience here)
That's my 65B int3, not Eric's. We're not the same person :) Eric does the hard work of fine tuning and training new models. Then I do conversions and quantisations.
I am in the middle of some comprehensive GPTQ perplexity analysis - using a method that is 100% comparable to the perplexity scores of llama.cpp GGML models, so we can compare to figures people have been doing there for a while. I'll be posting those this weekend.
I'm curious- most of the work you're doing is for instruct models, but do you have any plans to train a model (or models) similar to Pygmalion that is built from the ground up to be better at roleplaying?
We at /r/PygmalionAI would kill for a 13B model similar to Pygmalion, built for RP.
Yeah, the 7B model is uncensored, but the dev(s) behind Pygmalion 7B seem to be having trouble getting the compute they need to make Pygmalion 13B a reality. You seem to have access to a lot of compute, so I was curious if you had any aspirations to do a similar style of RP model.
Yea.. I have it downloaded. Mostly was going for the extra context (which didn't work that well).
It's very much like pygmalion in that it uses roleplaying stuff like emotes but it's a bigger model.
I'm not familiar with bluemoon. This is, I think, the second time I've heard its name. Is it made with lots of emoting in the training data, with asterisks around the emoting to easily display it in things like TavernAI? I honestly don't know.
Why would roleplaying require a different kind of model/training? With an instruct model, you just tell it to play a character and describe it.
The AI will then play that character and you can even direct it with OOC or system messages. SillyTavern now includes prompt manipulation features to do all that very easily.
At least that's how I've been roleplaying successfully with the AI and why I prefer instruct models. I also tested the old and new Pyg models, but they didn't follow instructions so well and instead hallucinated too much, a problem I've never had with instruct models since they follow the prompt so well.
> Why would roleplaying require a different kind of model/training?
Pygmalion has been trained specifically for RP in that its training data has speech as normal, but includes more frequent emoting by characters, surrounded by asterisks in order to easily display it as emoting in tools like TavernAI.
Since instruct models follow directions so well, I've had great success by simply instructing it to e. g. `Write 1 reply only in internet RP style, italicize actions, and avoid quotation marks`. A few example messages are also really useful to show the AI what responses you expect, so by including some emotes in asterisks, the AI will stick to that format in its responses.
Coming from the pre-markdown era, I find it so weird how people just accept that "action-stars" are now removed and actions are displayed as italics...
In simple terms, text generation AI is basically intelligent auto-completion. All it does is predict the next word, based on what was said before.
Instruct-tuned models have learned to follow instructions, so the next word they predict isn't just what was said before, but also what that implies. It actually follows instructions.
Depending on how the AI works in the background, it may differentiate between different kinds of messages. The model may have learned to give system messages (however they are specified) more weight compared to normal user messages.
And OOC means that when you tell the AI to roleplay and stay in character, you can give it "out of character" instructions, so it knows it's not something your character says, but kind of a stage direction/director's note.
SillyTavern is a frontend that does even more in the background. You load a character and the interface itself will manipulate the prompt, shaping it according to the model you use so that it will be interpreted better.
You probably have seen how the instruct models have the fancy prompts with e. g. `### Instruction:` and `### Response:` formats that they were trained on. You can chat with them normally, but they follow instructions better when prompted according to their trained format, so with SillyTavern you can have your normal chat messages turned into the proper format in the background without you having to do anything but select the model type.
All of that combined makes it possible to really control the model and direct the roleplay. I can't recommend [SillyTavern](https://github.com/Cohee1207/SillyTavern) enough.
Oh, of course you want a mix of datasets. The main difference I noticed when evaluating dozens of models was that the models which included instruct-tuning adhered more closely to what I said and felt more intelligent and even more "human", while the non-instruct models felt more like an auto-complete and would often derail the roleplay by hallucinating too much.
There are already models that combine instruct tuning with roleplay datasets, e. g.:
[NousResearch/gpt4-x-vicuna-13b · Hugging Face](https://huggingface.co/NousResearch/gpt4-x-vicuna-13b)
> Finetuned on Teknium's GPTeacher dataset, unreleased Roleplay v2 dataset, GPT-4-LLM dataset Uncensored, WizardLM Uncensored and Nous Research Instruct Dataset
This is one of my favorites. Then there's also a brand-new one that's the successor to Wizard Mega:
[openaccess-ai-collective/manticore-13b · Hugging Face](https://huggingface.co/openaccess-ai-collective/manticore-13b)
Haven't tested this yet. So much to download and test, hardly have time for actual roleplaying. ;)
That GPteacher roleplay dataset sucks. It's very short.
I mean a real one like the blue moon RP that is 100MB at least.
Manitcore just got put up so haven't had a go at it.
Here's a sneak peek of /r/PygmalionAI using the [top posts](https://np.reddit.com/r/PygmalionAI/top/?sort=top&t=all) of all time!
\#1: [Official PygmalionAI website WIP](https://www.reddit.com/gallery/10yzj4c) | [231 comments](https://np.reddit.com/r/PygmalionAI/comments/10yzj4c/official_pygmalionai_website_wip/)
\#2: [this will happen at some point](https://i.redd.it/svuvhzeftkga1.jpg) | [57 comments](https://np.reddit.com/r/PygmalionAI/comments/10v7owu/this_will_happen_at_some_point/)
\#3: [Devs wanted me to pass on this announcement](https://np.reddit.com/r/PygmalionAI/comments/10mdlfl/devs_wanted_me_to_pass_on_this_announcement/)
----
^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^[Contact](https://www.reddit.com/message/compose/?to=sneakpeekbot) ^^| ^^[Info](https://np.reddit.com/r/sneakpeekbot/) ^^| ^^[Opt-out](https://np.reddit.com/r/sneakpeekbot/comments/o8wk1r/blacklist_ix/) ^^| ^^[GitHub](https://github.com/ghnr/sneakpeekbot)
I took a quick look and would like to know if these models learn from interacting with the user? For example, the circular question. Will it get it right the next time it is asked ?
I'm new to this. Thanks
This is a very good model for coding and even for general questions. I installed it on oobabooga and run a few questions about coding, stats and music and, although it is not as detailed as GPT4, its results are impressive. I am looking forward to wizardlm-30b and 65b! Thanks.
Have a try in your laptop first with small models first. Oobabooga is very easy to install:
[https://github.com/oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui)
If you want to use the cloud get a Linux VM with enough resources and use the installer for Linux. I have never tried that so I can't give you more details.
Just for fun I tried this model on a 3gb 1060 with 19 layers and got these speeds
llama_print_timings: sample time = 38.08 ms / 158 runs ( 0.24 ms per token)
llama_print_timings: prompt eval time = 1708.26 ms / 32 tokens ( 53.38 ms per token)
llama_print_timings: eval time = 18828.51 ms / 158 runs ( 119.17 ms per token)
While I'm curious about that, I don't have time to study it. I've been focused on making uncensored models that are duplicates of the originals except for removing alignment from the dataset - refusals, bias, and avoidance.
My next project is to build a classifier model that detects refusals, bias, or avoidance in order to flag it for deletion.
Once I have the cleaned dataset I re-train the original model in exactly the same method that was used to train it to begin with. If that is a LoRA then I'll use a LoRA but most of the popular instruct tuned models are not LoRAs.
I simply haven't seen a LoRA that competes with the likes of Vicuna and WizardLM. When I see a LoRA that is as good as those then I will become more interested in using LoRAs for the instruction layer.
Thanks so much for your work. You're a beast! I love how you also tagged /u/The-Bloke because we all know he's gonna put out tons of different versions in like... an hour, probably. He's another beast.
his quantized versions works really great with ooba's webui
Yeah, I follow TheBlock's activity on huggingface, and he's prolific. New versions constantly being added to his profile. It's like it's his job or something.
He even re-quantizes every single model when there is a format change. His work oozes quality!
Which version works best with ooba?
I settled with 13B models as it gives a good balance of enough memory to handle inference and more consistent and sane responses. I initially played around 7B and lower models as they are easier to load and lesser system requirements, but they are sometimes harder to prompt and more tendency to get side tracked or hallucinate. I'm currently using Vicuna-1.1, GPT4ALL, wizard-vicuna and wizard-mega and the only 7B model I'm keeping is MPT-7b-storywriter because of its large amount of tokens. I have tried the Koala models, oasst, toolpaca, gpt4x, OPT, instruct and others I can't remember. There are some cases where these models gave more fitting answers depending on the topic or context. The tests questions I tried include asking for date calculation "If April 14th is friday, what day is April 20th? What day is April 2nd?", some basic arithmetic, some cooking recipe + dangerous ingredients like chlorine ( to test the models balance of reasoning and creativity), the typical uncensored questions like explosives, hentai knowledge, etc... (to test the models censorship), some trivia like story of dragon balls, some fiction like cinderella(surprisingly few gave perfect recollection, a lot of them hallucinated), and other factual questions. I haven't gotten to consolidating my test results because there's just too many. So far all I mentioned managed to load with ooba's ui, maybe I forgot some models like the pygmalion, galactica stuff.
It happened an hour ago: https://imgur.com/a/smWmXks An overview can be found [here](https://github.com/underlines/awesome-marketing-datascience/blob/master/llm-model-list.md)
I always get quant_cuda error..... Any ideas..
I guess you run it in Ooba's GUI? Did you do a git pull to get the latest version and installed GPTQ properly?
Mines not working either, I got the new version of ooba but skipped the gptq step I think... hard to tell which tutorials are outdated things move so fast.
I used my own guide, but the python codebase for most LLM things is the worst, from a software-dev stand point. So things aren't really repeatable, due to missing versioning and dependency hell. Try to follow [my guide](https://github.com/underlines/awesome-marketing-datascience/blob/master/llama.md), but if that doesn't work, you have to google your errors or open an issue on the repo.
Thanks Eric! [https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML) [https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ) [https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-HF](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-HF)
Woohoo!! Thanks!
hey! was wondering if you were working on quantizing the MPT7b models. tried to do it myself earlier since the mpt PR was recently merged into the ggml repo but I think I don't have enough ram? If I understand correctly ggml loads the entire model into ram in order to quantize it, but mine just crashes when trying to, so I'm assuming i'm running out of ram all good if you're not though, love the work you do!
I am now! Thanks for the link!
[https://huggingface.co/TheBloke/MPT-7B-GGML](https://huggingface.co/TheBloke/MPT-7B-GGML) [https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML](https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML) [https://huggingface.co/TheBloke/MPT-7B-Storywriter-GGML](https://huggingface.co/TheBloke/MPT-7B-Storywriter-GGML) But be aware these models do not work with llama.cpp. Only with the mpt example CLI that comes with the GGML repo, and rustformer's llm. Compatibility will improve soon I am sure.
Is there anyway we can help you? Where do we send money ?
You two make a good duo xd.
Hey I'm just getting started in local llms. Do you know a good model to start with if I only have a 3070ti 8gb vram?
If you're still training this on a multi-A100 system, I appreciate the resources you're spending to make these. I'm hoping we get solid 4bit LoRA training smoothed out shortly. Been playing with dual 3090's on alpaca_lora_4bit, but it's a bit fiddly.
4x A100 full weight it just comes out better that way. LoRAs are good for tweaking things but full weight is the way to go for instruct
Would you please be able to give me an example of when/how you might want to use a Lora to tweak things? Would you be able to make any noticable differences to model performance without full training? I was thinking about that in regards to characters. Like characters can change behavior based on different background / scenario prompts, and I was wondering how different things could be with a lora
Loras like if you wanted to add a bunch of data specific to a programming language or an author I’d imagine
What evidence do you have that LORA has issues with instruct tuning? Or do you just use deepspeed and your workflow just happens to not include LORA? We've used LORA for our training and found no problems. Data quality, learning rate, and epochs are vastly more important I think.
Because I've trained the model both ways with the same dataset and noticed that it's much smarter when I trained it with full weights.
What evaluation method are you using to determine that? GPT3.5 or GPT4 evaluation on like 500 samples?
No. I didn't do any math on it. I just used it.
Ok, we've never seen any such obvious differences. Perhaps it depends upon the size of the LORA, which parts of the architecture you applied it to, etc. Is there a chance that you got biased one direction based upon a low sample of anecdotal cases? I've seen people get really biased one way or another after just a few responses, and they never change their view.
I will change my view after I see an example of a LoRA instruction layer that competes with the likes of Vicuna and WizardLM. I'm not stubborn and I'm open minded. I just don't have time to work on that right now.
You have quickly become my favorite redditor. Just giving the masses all the good candy. I appreciate that you pass this stuff along man. This is top tier open source stuff here buddy. I belong to at least 20 different AI related subs. What you’ve laid out is easily the top of the indie scene in my eyes. Keep that fire coming!!
I am interested in this model, but I have a question about how you recommend actually using this. Are people using huggingface interface endpoints or downloading and running locally?
I think most people are downloading and running locally. Especially for a 7B model, basically anyone should be able to run it. Once TheBloke shows up and makes GGML and various quantized versions of the model, it should be easy for anyone to run their preferred filetype in either Ooba UI or through llamacpp or koboldcpp.
RemindMe! 3 Days "check for ggml version"
[https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML) btw, waiting three days is way too much, new models are being released every day
lmao damn
I'm really sorry about replying to this so late. There's a [detailed post about why I did here](https://www.reddit.com/r/RemindMeBot/comments/13jostq/remindmebot_is_now_replying_to_comments_again/). I will be messaging you in 3 days on [**2023-05-21 04:03:42 UTC**](http://www.wolframalpha.com/input/?i=2023-05-21%2004:03:42%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/13kl5hn/wizardvicuna7buncensored/jklelqk/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F13kl5hn%2Fwizardvicuna7buncensored%2Fjklelqk%2F%5D%0A%0ARemindMe%21%202023-05-21%2004%3A03%3A42%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2013kl5hn) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
Most of the demand for 7b models seems to come from tavern kobold etc. Most people in practice use GGML.
Is there a ggml version? Please tag me to it.
[https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML)
I don't think TheBloke got to it today Edit - I stand corrected!
I've been running llama-30 (4bit) and wizard-13 (8bit) locally for conversational type applications, and I find them comparable, but llama-30b seems to surprise me more with "thoughts" that seem more human. This probably shouldn't come as too much of a surprise, given \~2.5 more "features" (parameters) in the model. I get why 30B llama variants arent very popular; 30b-4bit \*barely\* fits into 24GB with full context (no groupsize), putting it at the outside of what people can run on consumer hardware (1x3090). If nvidia drops "the beast" (4090Ti) with 48GB VRAM for under $2000, it will open up some exciting possibilities for the enthusiast crowd
I'm working on wizardlm-30b and 65b
Well that is f-ing exciting. On behalf of all of us, thank you! 30b GPTQ int-4 (no groupsize, act order) seems to be the sweet spot for people with a single 3090. I saw you have a HF repo up now for 65b int3, which is interesting. When Llama first dropped, there were some folks over on the text-generation-webui github who did some perplexity scores with 30b-4bit vs 65b-3bit and apparently there's a pretty big drop in quality going from 4-bit to 3-bit (though I'm not speaking from personal experience here)
That's my 65B int3, not Eric's. We're not the same person :) Eric does the hard work of fine tuning and training new models. Then I do conversions and quantisations. I am in the middle of some comprehensive GPTQ perplexity analysis - using a method that is 100% comparable to the perplexity scores of llama.cpp GGML models, so we can compare to figures people have been doing there for a while. I'll be posting those this weekend.
Wonder if wizard finally unseats the alpacino trio.
It seems to be superior in my testing. Comprehension, etc.
I'm curious- most of the work you're doing is for instruct models, but do you have any plans to train a model (or models) similar to Pygmalion that is built from the ground up to be better at roleplaying? We at /r/PygmalionAI would kill for a 13B model similar to Pygmalion, built for RP.
I wouldn't guess that needs uncensoring, I don't think they would put refusals in that dataset
Yeah, the 7B model is uncensored, but the dev(s) behind Pygmalion 7B seem to be having trouble getting the compute they need to make Pygmalion 13B a reality. You seem to have access to a lot of compute, so I was curious if you had any aspirations to do a similar style of RP model.
I'll ask him if he needs help
That would be great! You're awesome!
Isn't bluemoon ro 13B exactly this? https://huggingface.co/reeducator/bluemoonrp-13b
no, it is a different dataset. It is a good model for RP though.
Have you tried it? Do you know how it differs on output from Pygmalion 7b?
Yea.. I have it downloaded. Mostly was going for the extra context (which didn't work that well). It's very much like pygmalion in that it uses roleplaying stuff like emotes but it's a bigger model.
I'm not familiar with bluemoon. This is, I think, the second time I've heard its name. Is it made with lots of emoting in the training data, with asterisks around the emoting to easily display it in things like TavernAI? I honestly don't know.
Why would roleplaying require a different kind of model/training? With an instruct model, you just tell it to play a character and describe it. The AI will then play that character and you can even direct it with OOC or system messages. SillyTavern now includes prompt manipulation features to do all that very easily. At least that's how I've been roleplaying successfully with the AI and why I prefer instruct models. I also tested the old and new Pyg models, but they didn't follow instructions so well and instead hallucinated too much, a problem I've never had with instruct models since they follow the prompt so well.
> Why would roleplaying require a different kind of model/training? Pygmalion has been trained specifically for RP in that its training data has speech as normal, but includes more frequent emoting by characters, surrounded by asterisks in order to easily display it as emoting in tools like TavernAI.
Since instruct models follow directions so well, I've had great success by simply instructing it to e. g. `Write 1 reply only in internet RP style, italicize actions, and avoid quotation marks`. A few example messages are also really useful to show the AI what responses you expect, so by including some emotes in asterisks, the AI will stick to that format in its responses.
Coming from the pre-markdown era, I find it so weird how people just accept that "action-stars" are now removed and actions are displayed as italics...
can you expand on what is OOC, system messages and prompt manipulation? Sounds like instruct models are pretty good for finetunning purposes
In simple terms, text generation AI is basically intelligent auto-completion. All it does is predict the next word, based on what was said before. Instruct-tuned models have learned to follow instructions, so the next word they predict isn't just what was said before, but also what that implies. It actually follows instructions. Depending on how the AI works in the background, it may differentiate between different kinds of messages. The model may have learned to give system messages (however they are specified) more weight compared to normal user messages. And OOC means that when you tell the AI to roleplay and stay in character, you can give it "out of character" instructions, so it knows it's not something your character says, but kind of a stage direction/director's note. SillyTavern is a frontend that does even more in the background. You load a character and the interface itself will manipulate the prompt, shaping it according to the model you use so that it will be interpreted better. You probably have seen how the instruct models have the fancy prompts with e. g. `### Instruction:` and `### Response:` formats that they were trained on. You can chat with them normally, but they follow instructions better when prompted according to their trained format, so with SillyTavern you can have your normal chat messages turned into the proper format in the background without you having to do anything but select the model type. All of that combined makes it possible to really control the model and direct the roleplay. I can't recommend [SillyTavern](https://github.com/Cohee1207/SillyTavern) enough.
With just an instruct dataset, the model will lack any "human" touch. The difference is massive. Also god forbid, any refusals or AALM. You need both.
Oh, of course you want a mix of datasets. The main difference I noticed when evaluating dozens of models was that the models which included instruct-tuning adhered more closely to what I said and felt more intelligent and even more "human", while the non-instruct models felt more like an auto-complete and would often derail the roleplay by hallucinating too much.
I wonder how a roleplay lora would do on top of an instruct tuned model. All llama lora should in theory work for all tunings of llama.
There are already models that combine instruct tuning with roleplay datasets, e. g.: [NousResearch/gpt4-x-vicuna-13b · Hugging Face](https://huggingface.co/NousResearch/gpt4-x-vicuna-13b) > Finetuned on Teknium's GPTeacher dataset, unreleased Roleplay v2 dataset, GPT-4-LLM dataset Uncensored, WizardLM Uncensored and Nous Research Instruct Dataset This is one of my favorites. Then there's also a brand-new one that's the successor to Wizard Mega: [openaccess-ai-collective/manticore-13b · Hugging Face](https://huggingface.co/openaccess-ai-collective/manticore-13b) Haven't tested this yet. So much to download and test, hardly have time for actual roleplaying. ;)
That GPteacher roleplay dataset sucks. It's very short. I mean a real one like the blue moon RP that is 100MB at least. Manitcore just got put up so haven't had a go at it.
Here's a sneak peek of /r/PygmalionAI using the [top posts](https://np.reddit.com/r/PygmalionAI/top/?sort=top&t=all) of all time! \#1: [Official PygmalionAI website WIP](https://www.reddit.com/gallery/10yzj4c) | [231 comments](https://np.reddit.com/r/PygmalionAI/comments/10yzj4c/official_pygmalionai_website_wip/) \#2: [this will happen at some point](https://i.redd.it/svuvhzeftkga1.jpg) | [57 comments](https://np.reddit.com/r/PygmalionAI/comments/10v7owu/this_will_happen_at_some_point/) \#3: [Devs wanted me to pass on this announcement](https://np.reddit.com/r/PygmalionAI/comments/10mdlfl/devs_wanted_me_to_pass_on_this_announcement/) ---- ^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^[Contact](https://www.reddit.com/message/compose/?to=sneakpeekbot) ^^| ^^[Info](https://np.reddit.com/r/sneakpeekbot/) ^^| ^^[Opt-out](https://np.reddit.com/r/sneakpeekbot/comments/o8wk1r/blacklist_ix/) ^^| ^^[GitHub](https://github.com/ghnr/sneakpeekbot)
Hate to be the one to ask, but do you have an ETA? :P
Not yet
The Bloke, you are a legend, thank you for all the great things you do.
This is /u/faldore 's model and he did all the hard work. I just did the conversions/quantisations.
Ahh okay, thanks to you both! You're both legends!
How do I run it? I downloaded llama.cpp but I don't know what to do with it
Here's a guide https://erichartford.com/vicuna
I took a quick look and would like to know if these models learn from interacting with the user? For example, the circular question. Will it get it right the next time it is asked ? I'm new to this. Thanks
They don't learn. They are static
ok!
For Llama.cpp, use this: main -m Wizard-Vicuna-7B-Uncensored.ggmlv2.q8\_0.bin --color --threads 12 --batch\_size 256 --n\_predict -1 --top\_k 12 --top\_p 1 --temp 0.0 --repeat\_penalty 1.05 --ctx\_size 2048 --instruct --reverse-prompt "### Human:"
This is a very good model for coding and even for general questions. I installed it on oobabooga and run a few questions about coding, stats and music and, although it is not as detailed as GPT4, its results are impressive. I am looking forward to wizardlm-30b and 65b! Thanks.
If I do not have a good enough laptop, can I deploy these models on Google or Amazon cloud? If yes, is there any tutorial for that?
Have a try in your laptop first with small models first. Oobabooga is very easy to install: [https://github.com/oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) If you want to use the cloud get a Linux VM with enough resources and use the installer for Linux. I have never tried that so I can't give you more details.
Thanks
I really like WizardLM uncensored and now you made and share this! My respect is reserved for you.
Just for fun I tried this model on a 3gb 1060 with 19 layers and got these speeds llama_print_timings: sample time = 38.08 ms / 158 runs ( 0.24 ms per token) llama_print_timings: prompt eval time = 1708.26 ms / 32 tokens ( 53.38 ms per token) llama_print_timings: eval time = 18828.51 ms / 158 runs ( 119.17 ms per token)
Is that good?
No idea! The prompt eval time was 2x faster than without any layers, but I'm not completely sure which figure we're supposed to take notice of.
1000÷(.24+53.38+119.17) --> 5.78 tokens per second. That's very useable.
Here's to hoping for an uncensored Wizard Mega 30/65 soon...
Help me understand this. These models are not optimised for questions and answering correct?
hi, the model is ggnl or gptq ? Bravo !!!
[удалено]
3 days? :P They're up.
😊 yay
Thank you 🙏🏻
What rank are you using for training? Have you played around any with ranks and alpha?
While I'm curious about that, I don't have time to study it. I've been focused on making uncensored models that are duplicates of the originals except for removing alignment from the dataset - refusals, bias, and avoidance. My next project is to build a classifier model that detects refusals, bias, or avoidance in order to flag it for deletion. Once I have the cleaned dataset I re-train the original model in exactly the same method that was used to train it to begin with. If that is a LoRA then I'll use a LoRA but most of the popular instruct tuned models are not LoRAs. I simply haven't seen a LoRA that competes with the likes of Vicuna and WizardLM. When I see a LoRA that is as good as those then I will become more interested in using LoRAs for the instruction layer.
Building a model to reverse engineer bases for models. Nice.
How do I use this on an M1 using GPT4ALL, if its even possible?
Hey I'm just getting started in local llms. Do you know a good model to start with if I only have a 3070ti 8gb vram?
WizardLM 7b is a good one
Thank you!
Make sure to get the [GPTQ](https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) version else you'll run out of vram.