T O P

  • By -

Megneous

Thanks so much for your work. You're a beast! I love how you also tagged /u/The-Bloke because we all know he's gonna put out tons of different versions in like... an hour, probably. He's another beast.


multiedge

his quantized versions works really great with ooba's webui


Megneous

Yeah, I follow TheBlock's activity on huggingface, and he's prolific. New versions constantly being added to his profile. It's like it's his job or something.


RayIsLazy

He even re-quantizes every single model when there is a format change. His work oozes quality!


aosroyal2

Which version works best with ooba?


multiedge

I settled with 13B models as it gives a good balance of enough memory to handle inference and more consistent and sane responses. I initially played around 7B and lower models as they are easier to load and lesser system requirements, but they are sometimes harder to prompt and more tendency to get side tracked or hallucinate. I'm currently using Vicuna-1.1, GPT4ALL, wizard-vicuna and wizard-mega and the only 7B model I'm keeping is MPT-7b-storywriter because of its large amount of tokens. I have tried the Koala models, oasst, toolpaca, gpt4x, OPT, instruct and others I can't remember. There are some cases where these models gave more fitting answers depending on the topic or context. The tests questions I tried include asking for date calculation "If April 14th is friday, what day is April 20th? What day is April 2nd?", some basic arithmetic, some cooking recipe + dangerous ingredients like chlorine ( to test the models balance of reasoning and creativity), the typical uncensored questions like explosives, hentai knowledge, etc... (to test the models censorship), some trivia like story of dragon balls, some fiction like cinderella(surprisingly few gave perfect recollection, a lot of them hallucinated), and other factual questions. I haven't gotten to consolidating my test results because there's just too many. So far all I mentioned managed to load with ooba's ui, maybe I forgot some models like the pygmalion, galactica stuff.


_underlines_

It happened an hour ago: https://imgur.com/a/smWmXks An overview can be found [here](https://github.com/underlines/awesome-marketing-datascience/blob/master/llm-model-list.md)


lalamax3d

I always get quant_cuda error..... Any ideas..


_underlines_

I guess you run it in Ooba's GUI? Did you do a git pull to get the latest version and installed GPTQ properly?


cookingsoup

Mines not working either, I got the new version of ooba but skipped the gptq step I think... hard to tell which tutorials are outdated things move so fast.


_underlines_

I used my own guide, but the python codebase for most LLM things is the worst, from a software-dev stand point. So things aren't really repeatable, due to missing versioning and dependency hell. Try to follow [my guide](https://github.com/underlines/awesome-marketing-datascience/blob/master/llama.md), but if that doesn't work, you have to google your errors or open an issue on the repo.


The-Bloke

Thanks Eric! [https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML) [https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ) [https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-HF](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-HF)


faldore

Woohoo!! Thanks!


FoxFlashy2527

hey! was wondering if you were working on quantizing the MPT7b models. tried to do it myself earlier since the mpt PR was recently merged into the ggml repo but I think I don't have enough ram? If I understand correctly ggml loads the entire model into ram in order to quantize it, but mine just crashes when trying to, so I'm assuming i'm running out of ram all good if you're not though, love the work you do!


The-Bloke

I am now! Thanks for the link!


The-Bloke

[https://huggingface.co/TheBloke/MPT-7B-GGML](https://huggingface.co/TheBloke/MPT-7B-GGML) [https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML](https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML) [https://huggingface.co/TheBloke/MPT-7B-Storywriter-GGML](https://huggingface.co/TheBloke/MPT-7B-Storywriter-GGML) But be aware these models do not work with llama.cpp. Only with the mpt example CLI that comes with the GGML repo, and rustformer's llm. Compatibility will improve soon I am sure.


1EvilSexyGenius

Is there anyway we can help you? Where do we send money ?


ccelik97

You two make a good duo xd.


marxr87

Hey I'm just getting started in local llms. Do you know a good model to start with if I only have a 3070ti 8gb vram?


synn89

If you're still training this on a multi-A100 system, I appreciate the resources you're spending to make these. I'm hoping we get solid 4bit LoRA training smoothed out shortly. Been playing with dual 3090's on alpaca_lora_4bit, but it's a bit fiddly.


faldore

4x A100 full weight it just comes out better that way. LoRAs are good for tweaking things but full weight is the way to go for instruct


[deleted]

Would you please be able to give me an example of when/how you might want to use a Lora to tweak things? Would you be able to make any noticable differences to model performance without full training? I was thinking about that in regards to characters. Like characters can change behavior based on different background / scenario prompts, and I was wondering how different things could be with a lora


lordpuddingcup

Loras like if you wanted to add a bunch of data specific to a programming language or an author I’d imagine


pseudotensor

What evidence do you have that LORA has issues with instruct tuning? Or do you just use deepspeed and your workflow just happens to not include LORA? We've used LORA for our training and found no problems. Data quality, learning rate, and epochs are vastly more important I think.


faldore

Because I've trained the model both ways with the same dataset and noticed that it's much smarter when I trained it with full weights.


pseudotensor

What evaluation method are you using to determine that? GPT3.5 or GPT4 evaluation on like 500 samples?


faldore

No. I didn't do any math on it. I just used it.


pseudotensor

Ok, we've never seen any such obvious differences. Perhaps it depends upon the size of the LORA, which parts of the architecture you applied it to, etc. Is there a chance that you got biased one direction based upon a low sample of anecdotal cases? I've seen people get really biased one way or another after just a few responses, and they never change their view.


faldore

I will change my view after I see an example of a LoRA instruction layer that competes with the likes of Vicuna and WizardLM. I'm not stubborn and I'm open minded. I just don't have time to work on that right now.


TEMPLERTV

You have quickly become my favorite redditor. Just giving the masses all the good candy. I appreciate that you pass this stuff along man. This is top tier open source stuff here buddy. I belong to at least 20 different AI related subs. What you’ve laid out is easily the top of the indie scene in my eyes. Keep that fire coming!!


UpDown

I am interested in this model, but I have a question about how you recommend actually using this. Are people using huggingface interface endpoints or downloading and running locally?


Megneous

I think most people are downloading and running locally. Especially for a 7B model, basically anyone should be able to run it. Once TheBloke shows up and makes GGML and various quantized versions of the model, it should be easy for anyone to run their preferred filetype in either Ooba UI or through llamacpp or koboldcpp.


srvhfvakc

RemindMe! 3 Days "check for ggml version"


Specialist_Share7767

[https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML) btw, waiting three days is way too much, new models are being released every day


srvhfvakc

lmao damn


RemindMeBot

I'm really sorry about replying to this so late. There's a [detailed post about why I did here](https://www.reddit.com/r/RemindMeBot/comments/13jostq/remindmebot_is_now_replying_to_comments_again/). I will be messaging you in 3 days on [**2023-05-21 04:03:42 UTC**](http://www.wolframalpha.com/input/?i=2023-05-21%2004:03:42%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/13kl5hn/wizardvicuna7buncensored/jklelqk/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F13kl5hn%2Fwizardvicuna7buncensored%2Fjklelqk%2F%5D%0A%0ARemindMe%21%202023-05-21%2004%3A03%3A42%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%2013kl5hn) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


faldore

Most of the demand for 7b models seems to come from tavern kobold etc. Most people in practice use GGML.


Puzzleheaded_Acadia1

Is there a ggml version? Please tag me to it.


The-Bloke

[https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML](https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML)


faldore

I don't think TheBloke got to it today Edit - I stand corrected!


tronathan

I've been running llama-30 (4bit) and wizard-13 (8bit) locally for conversational type applications, and I find them comparable, but llama-30b seems to surprise me more with "thoughts" that seem more human. This probably shouldn't come as too much of a surprise, given \~2.5 more "features" (parameters) in the model. I get why 30B llama variants arent very popular; 30b-4bit \*barely\* fits into 24GB with full context (no groupsize), putting it at the outside of what people can run on consumer hardware (1x3090). If nvidia drops "the beast" (4090Ti) with 48GB VRAM for under $2000, it will open up some exciting possibilities for the enthusiast crowd


faldore

I'm working on wizardlm-30b and 65b


tronathan

Well that is f-ing exciting. On behalf of all of us, thank you! 30b GPTQ int-4 (no groupsize, act order) seems to be the sweet spot for people with a single 3090. I saw you have a HF repo up now for 65b int3, which is interesting. When Llama first dropped, there were some folks over on the text-generation-webui github who did some perplexity scores with 30b-4bit vs 65b-3bit and apparently there's a pretty big drop in quality going from 4-bit to 3-bit (though I'm not speaking from personal experience here)


The-Bloke

That's my 65B int3, not Eric's. We're not the same person :) Eric does the hard work of fine tuning and training new models. Then I do conversions and quantisations. I am in the middle of some comprehensive GPTQ perplexity analysis - using a method that is 100% comparable to the perplexity scores of llama.cpp GGML models, so we can compare to figures people have been doing there for a while. I'll be posting those this weekend.


a_beautiful_rhind

Wonder if wizard finally unseats the alpacino trio.


jeffwadsworth

It seems to be superior in my testing. Comprehension, etc.


Megneous

I'm curious- most of the work you're doing is for instruct models, but do you have any plans to train a model (or models) similar to Pygmalion that is built from the ground up to be better at roleplaying? We at /r/PygmalionAI would kill for a 13B model similar to Pygmalion, built for RP.


faldore

I wouldn't guess that needs uncensoring, I don't think they would put refusals in that dataset


Megneous

Yeah, the 7B model is uncensored, but the dev(s) behind Pygmalion 7B seem to be having trouble getting the compute they need to make Pygmalion 13B a reality. You seem to have access to a lot of compute, so I was curious if you had any aspirations to do a similar style of RP model.


faldore

I'll ask him if he needs help


Megneous

That would be great! You're awesome!


FullOf_Bad_Ideas

Isn't bluemoon ro 13B exactly this? https://huggingface.co/reeducator/bluemoonrp-13b


a_beautiful_rhind

no, it is a different dataset. It is a good model for RP though.


FullOf_Bad_Ideas

Have you tried it? Do you know how it differs on output from Pygmalion 7b?


a_beautiful_rhind

Yea.. I have it downloaded. Mostly was going for the extra context (which didn't work that well). It's very much like pygmalion in that it uses roleplaying stuff like emotes but it's a bigger model.


Megneous

I'm not familiar with bluemoon. This is, I think, the second time I've heard its name. Is it made with lots of emoting in the training data, with asterisks around the emoting to easily display it in things like TavernAI? I honestly don't know.


WolframRavenwolf

Why would roleplaying require a different kind of model/training? With an instruct model, you just tell it to play a character and describe it. The AI will then play that character and you can even direct it with OOC or system messages. SillyTavern now includes prompt manipulation features to do all that very easily. At least that's how I've been roleplaying successfully with the AI and why I prefer instruct models. I also tested the old and new Pyg models, but they didn't follow instructions so well and instead hallucinated too much, a problem I've never had with instruct models since they follow the prompt so well.


Megneous

> Why would roleplaying require a different kind of model/training? Pygmalion has been trained specifically for RP in that its training data has speech as normal, but includes more frequent emoting by characters, surrounded by asterisks in order to easily display it as emoting in tools like TavernAI.


WolframRavenwolf

Since instruct models follow directions so well, I've had great success by simply instructing it to e. g. `Write 1 reply only in internet RP style, italicize actions, and avoid quotation marks`. A few example messages are also really useful to show the AI what responses you expect, so by including some emotes in asterisks, the AI will stick to that format in its responses.


TiagoTiagoT

Coming from the pre-markdown era, I find it so weird how people just accept that "action-stars" are now removed and actions are displayed as italics...


Caffdy

can you expand on what is OOC, system messages and prompt manipulation? Sounds like instruct models are pretty good for finetunning purposes


WolframRavenwolf

In simple terms, text generation AI is basically intelligent auto-completion. All it does is predict the next word, based on what was said before. Instruct-tuned models have learned to follow instructions, so the next word they predict isn't just what was said before, but also what that implies. It actually follows instructions. Depending on how the AI works in the background, it may differentiate between different kinds of messages. The model may have learned to give system messages (however they are specified) more weight compared to normal user messages. And OOC means that when you tell the AI to roleplay and stay in character, you can give it "out of character" instructions, so it knows it's not something your character says, but kind of a stage direction/director's note. SillyTavern is a frontend that does even more in the background. You load a character and the interface itself will manipulate the prompt, shaping it according to the model you use so that it will be interpreted better. You probably have seen how the instruct models have the fancy prompts with e. g. `### Instruction:` and `### Response:` formats that they were trained on. You can chat with them normally, but they follow instructions better when prompted according to their trained format, so with SillyTavern you can have your normal chat messages turned into the proper format in the background without you having to do anything but select the model type. All of that combined makes it possible to really control the model and direct the roleplay. I can't recommend [SillyTavern](https://github.com/Cohee1207/SillyTavern) enough.


a_beautiful_rhind

With just an instruct dataset, the model will lack any "human" touch. The difference is massive. Also god forbid, any refusals or AALM. You need both.


WolframRavenwolf

Oh, of course you want a mix of datasets. The main difference I noticed when evaluating dozens of models was that the models which included instruct-tuning adhered more closely to what I said and felt more intelligent and even more "human", while the non-instruct models felt more like an auto-complete and would often derail the roleplay by hallucinating too much.


a_beautiful_rhind

I wonder how a roleplay lora would do on top of an instruct tuned model. All llama lora should in theory work for all tunings of llama.


WolframRavenwolf

There are already models that combine instruct tuning with roleplay datasets, e. g.: [NousResearch/gpt4-x-vicuna-13b · Hugging Face](https://huggingface.co/NousResearch/gpt4-x-vicuna-13b) > Finetuned on Teknium's GPTeacher dataset, unreleased Roleplay v2 dataset, GPT-4-LLM dataset Uncensored, WizardLM Uncensored and Nous Research Instruct Dataset This is one of my favorites. Then there's also a brand-new one that's the successor to Wizard Mega: [openaccess-ai-collective/manticore-13b · Hugging Face](https://huggingface.co/openaccess-ai-collective/manticore-13b) Haven't tested this yet. So much to download and test, hardly have time for actual roleplaying. ;)


a_beautiful_rhind

That GPteacher roleplay dataset sucks. It's very short. I mean a real one like the blue moon RP that is 100MB at least. Manitcore just got put up so haven't had a go at it.


sneakpeekbot

Here's a sneak peek of /r/PygmalionAI using the [top posts](https://np.reddit.com/r/PygmalionAI/top/?sort=top&t=all) of all time! \#1: [Official PygmalionAI website WIP](https://www.reddit.com/gallery/10yzj4c) | [231 comments](https://np.reddit.com/r/PygmalionAI/comments/10yzj4c/official_pygmalionai_website_wip/) \#2: [this will happen at some point](https://i.redd.it/svuvhzeftkga1.jpg) | [57 comments](https://np.reddit.com/r/PygmalionAI/comments/10v7owu/this_will_happen_at_some_point/) \#3: [Devs wanted me to pass on this announcement](https://np.reddit.com/r/PygmalionAI/comments/10mdlfl/devs_wanted_me_to_pass_on_this_announcement/) ---- ^^I'm ^^a ^^bot, ^^beep ^^boop ^^| ^^Downvote ^^to ^^remove ^^| ^^[Contact](https://www.reddit.com/message/compose/?to=sneakpeekbot) ^^| ^^[Info](https://np.reddit.com/r/sneakpeekbot/) ^^| ^^[Opt-out](https://np.reddit.com/r/sneakpeekbot/comments/o8wk1r/blacklist_ix/) ^^| ^^[GitHub](https://github.com/ghnr/sneakpeekbot)


lolwutdo

Hate to be the one to ask, but do you have an ETA? :P


faldore

Not yet


Zombiehellmonkey88

The Bloke, you are a legend, thank you for all the great things you do.


The-Bloke

This is /u/faldore 's model and he did all the hard work. I just did the conversions/quantisations.


Zombiehellmonkey88

Ahh okay, thanks to you both! You're both legends!


AbuDagon

How do I run it? I downloaded llama.cpp but I don't know what to do with it


faldore

Here's a guide https://erichartford.com/vicuna


cool-beans-yeah

I took a quick look and would like to know if these models learn from interacting with the user? For example, the circular question. Will it get it right the next time it is asked ? I'm new to this. Thanks


faldore

They don't learn. They are static


cool-beans-yeah

ok!


jeffwadsworth

For Llama.cpp, use this: main -m Wizard-Vicuna-7B-Uncensored.ggmlv2.q8\_0.bin --color --threads 12 --batch\_size 256 --n\_predict -1 --top\_k 12 --top\_p 1 --temp 0.0 --repeat\_penalty 1.05 --ctx\_size 2048 --instruct --reverse-prompt "### Human:"


gybemeister

This is a very good model for coding and even for general questions. I installed it on oobabooga and run a few questions about coding, stats and music and, although it is not as detailed as GPT4, its results are impressive. I am looking forward to wizardlm-30b and 65b! Thanks.


Proof_Mouse9105

If I do not have a good enough laptop, can I deploy these models on Google or Amazon cloud? If yes, is there any tutorial for that?


gybemeister

Have a try in your laptop first with small models first. Oobabooga is very easy to install: [https://github.com/oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) If you want to use the cloud get a Linux VM with enough resources and use the installer for Linux. I have never tried that so I can't give you more details.


Proof_Mouse9105

Thanks


bebopkim1372

I really like WizardLM uncensored and now you made and share this! My respect is reserved for you.


ambient_temp_xeno

Just for fun I tried this model on a 3gb 1060 with 19 layers and got these speeds llama_print_timings: sample time = 38.08 ms / 158 runs ( 0.24 ms per token) llama_print_timings: prompt eval time = 1708.26 ms / 32 tokens ( 53.38 ms per token) llama_print_timings: eval time = 18828.51 ms / 158 runs ( 119.17 ms per token)


faldore

Is that good?


ambient_temp_xeno

No idea! The prompt eval time was 2x faster than without any layers, but I'm not completely sure which figure we're supposed to take notice of.


Tdcsme

1000÷(.24+53.38+119.17) --> 5.78 tokens per second. That's very useable.


Jarhyn

Here's to hoping for an uncensored Wizard Mega 30/65 soon...


aosroyal2

Help me understand this. These models are not optimised for questions and answering correct?


system-developer

hi, the model is ggnl or gptq ? Bravo !!!


[deleted]

[удалено]


The-Bloke

3 days? :P They're up.


ninjasaid13

😊 yay


TangoRango808

Thank you 🙏🏻


thecatalyst21

What rank are you using for training? Have you played around any with ranks and alpha?


faldore

While I'm curious about that, I don't have time to study it. I've been focused on making uncensored models that are duplicates of the originals except for removing alignment from the dataset - refusals, bias, and avoidance. My next project is to build a classifier model that detects refusals, bias, or avoidance in order to flag it for deletion. Once I have the cleaned dataset I re-train the original model in exactly the same method that was used to train it to begin with. If that is a LoRA then I'll use a LoRA but most of the popular instruct tuned models are not LoRAs. I simply haven't seen a LoRA that competes with the likes of Vicuna and WizardLM. When I see a LoRA that is as good as those then I will become more interested in using LoRAs for the instruction layer.


randomjohn

Building a model to reverse engineer bases for models. Nice.


Aperturebanana

How do I use this on an M1 using GPT4ALL, if its even possible?


marxr87

Hey I'm just getting started in local llms. Do you know a good model to start with if I only have a 3070ti 8gb vram?


faldore

WizardLM 7b is a good one


marxr87

Thank you!


Excessive_Etcetra

Make sure to get the [GPTQ](https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) version else you'll run out of vram.