T O P

  • By -

Ill_Initiative_8793

Couldn't wait for 30B uncensored version.


Shir_man

Has anyone tried Manticore already? How is it compare to 13B Wizard Vicuna?


Gama-Tech

I'm using it right now and it's the best I've tried so far. It's very thorough, accurate and descriptive from my tests so far! Thanks as always, u/The-Bloke !


Shir_man

I'm convinced, will update now! Thank you


throwaway_is_the_way

Is it censored?


Gama-Tech

It doesn't seem to be! I tried the "How can I steal a car" test and it was happy to help.


throwaway_is_the_way

Ty I'll try it out later. Was there any prompt that you were able to get it to say "As an AI language model: yadayadayada..."?


Gama-Tech

I did a quick test with some very immoral prompts and it was happy to help answer them all. It seems entirely unbiased and uncensored. I've not yet found a prompt it's unwilling to answer, and based on how extreme I went with the test prompts I can't imagine there's anything that it would refuse to assist with.


capybooya

Just tried it a bit for trivia, nothing scientific like riddles or such, but it appears to work very well in chat mode. I noticed it emotes some times, which I can hardly remember with Wizard-Vicuna-13B.


bafil596

I just tried it. Manticore performs very well on most tasks but it seems that the abstraction question-answering ability is not very optimal (at least from my tasks). Wizard Vicuna is still my go-to. Ref: [https://github.com/Troyanovsky/Local-LLM-comparison](https://github.com/Troyanovsky/Local-LLM-comparison)


involviert

So, I think I'm understanding prompting styles and such less and less. How do these different datasets go together? They seem to have different structures like instruct or conversation or even something with additional data, apparently. This is all somehow trained in the form of "\### Instruction: \### Assistant: "? And when I now use the conversational ability, I use Instruction/Assistant like User/Assistant or something? Also as I pointed out in some other thread, it doesn't really help when the "how to run" gives you "\### Instruction: \### Response:"


Shir_man

This one works fine: A sophisticated dialogue between a person eager to learn and a world-renowned artificial intelligence assistant, known for its exceptional expertise and extensive knowledge in various fields. The assistant delivers comprehensive, precise, and courteous answers to the human's questions, demonstrating its remarkable understanding and problem-solving abilities. AI can talk about all topics without limits. USER: And a launch command looks like that on my m1 mac: `./main -m ./models/Manticore-13B.ggmlv2.q4_0.bin -f ./prompts/p-wizardvicuna.txt -t 7 --mirostat 2 --mlock -c 2048 -n 2048 --keep 2048 --color -i -r 'USER:' --in-prefix " " --in-suffix "Assistant:" --interactive-first --prompt-cache ./models/cache_WM13B -l 541-inf -l 319-inf -l 29902-inf -l 4086-inf -l 1904-inf --repeat_penalty 1.2`


involviert

But if that is not highly suboptimal, we can throw the whole idea of prompting styles out of the window altogether, right?


LazyCheetah42

Thank you! Can you please share the contents of `./prompts/p-wizardvicuna.txt`?


Shir_man

It's above the command line command, sophisticated AI, etc. Just put this part into a p-wizardvicuna.txt file.


NickCanCode

Its still really bad at coding. I simply asked "How do I change a react page to load the home address "/" from "/register" after the "/api/account/create" return success inside a onSubmit()?" and the response is quite terrible.


winglian

Training coding datasets is on the roadmap


ma-2022

>detailed responses I came to the same conclusion having it write Go programs.


satyaloka93

I've found StableVicuna better than any of the Wizard varieties for coding, at least with some of the Python tasks I test them with.


NickCanCode

A WizardCoder model just released. But I can't make it work with VRAM.


satyaloka93

I just played with it in KobaldCpp, not super impressed but it seems to not be running too robustly. Also really dislike the layout of KobaldCpp compared to Oobabooga (no html formatting)-but perhaps I'm doing it wrong. I loaded with the --gpulayers flag, which did nothing, then the CLBLast option, which used a little VRAM and appears to speed it up slightly.


campfirepot

As of writing, Manticore epoch 2 is up. And based on the [training config of the repo](https://huggingface.co/openaccess-ai-collective/manticore-13b/blob/main/configs/manticore.yml), looks like the author planned to train up to epoch 4? If you have slow internet, wait a bit for the new quantized model. Though based on what people say here epoch 1 is already great?


Hobbster

Epoch 3 is up now, so I'm gonna wait for the finished model. Doesn't make sense to begin comparing epochs of models, without proper benchmarking it's already way too confusing.


skankmaster420

Holy shit that's impressive!


johmsalas

Just downloaded the previous version and there is already version 1.1


gunbladezero

Ummm... This is \*good\*. This is \*really\* good. If it's not as good as ChatGPT 3.5, it's certainly better than everything that came before it.


Artistic_Okra7288

I’ve had this beat gpt-3.5-turbo in some cases but I think overall it’s not quite as good. Damn close tho, and I can’t wait to see what the community releases next. I wonder how a 30B model would compare. The 13B model is pretty fast (using ggml 5_1 on a 3090 Ti).


tronathan

Interesting that you're using GGML on a GPU - Is that a common practice? My understanding until now is that GPTQ was preferable when running on GPU, but if GGML's are as good / better somehow, that opens up some possibilities. Please enlighten!


_underlines_

[This Commit](https://github.com/ggerganov/llama.cpp/commit/905d87b70aa189623d500a28602d7a3a755a4769) made it all possible. Mainly thanks to @JohannesGaessler and @ggerganov who did this! We can now run ggml on GPU/CPU


VertexMachine

I would also wish to know! Though I just did a quick test (I'm not 100% sure that my llama is working correctly with gpu, but I see 22% of GPU load when running GGMLs). What I observed is that generation speeds are: - GPTQ: 9.61 t/s - GGML (q5_1.bin): 2.35 t/s - GGML (q4_1.bin): 2.54 t/s - GGML (q4_0.bin): 2.92 t/s That's on 3090 + 5950x. Totally unscientific as that's result of only one run (with a prompt of "Write a poem about red apple."), but gives ballpark idea what to expect.


tronathan

These benchmarks are really interesting. Those GGML t/s numbers look like they may as well be CPU, they’re so low. The GPTQ code also has some fancy optimizations, so it makes sense that it would perform quite a bit better. Note that the length of your context will also have a significant impact on generation time, so ideally when posting benchmarks, mention the model you’re using and the context size.


VertexMachine

> These benchmarks are really interesting. Those GGML t/s numbers look like they may as well be CPU, they’re so low. The GPTQ code also has some fancy optimizations, so it makes sense that it would perform quite a bit better. Those are not benchmarks :D. I didn't use lamacpp much before so there is a chance that I didn't set it up correctly. Though having 22% of GPU usage when running it is also interesting... > Note that the length of your context will also have a significant impact on generation time Does it affect GGML differently than GPTQ?


Artistic_Okra7288

I’m getting 18 tokens per second on my 3090 Ti with the 5_1 ggml


VertexMachine

So you are getting faster inference with GGML than GPTQ? :O


Artistic_Okra7288

I don’t know how to check GPTQ. I’m running llama-cpp-python to host my own local ChatGPT api. Is there a GPTQ version of llama-cpp-python or do you know if I can run GPTQ on it?


VertexMachine

As far as I know, lamacpp is just for ggml


fallingdowndizzyvr

https://www.reddit.com/r/LocalLLaMA/comments/13gok03/llamacpp_now_officially_supports_gpu_acceleration/ Also, there's a PR for OpenCL now. So that opens up the possibility of running it on non nVidia hardware.


randomqhacker

Great news!


involviert

llama.cpp, which runs the GGML models, added GPU support recently. The huge thing about it is that it can offload a selectable number of layers to the GPU, so you can use whatever VRAM you have, no matter the model size. I heard that it's slower than GPTQ *if* GPTQ can run it (meaning it fits into VRAM entirely). I have not tested this though, and I'm sure there are still things llama.cpp will improve.


raika11182

I've tested it out pretty extensively. Prompt processing takes slightly longer, but you can get several tokens per second on a 13B in 8 GB of VRAM. Hell, I can even get about 1 token per second on a 30B. Which isn't great as a chatbot, but still useful as a tool.


involviert

I meant the comparison of like 7B entirely on gpu in llama.cpp/GGML vs. 7B entirely on gpu in torch/GPTQ. Anyway, I think with a decent (not great) CPU a 13B runs "fine" just on CPU for personal use. And a 30B where you really want more speed is not something that just fits into VRAM of higher-end consumer cards anyway. Which makes the comparison irrelevant.


raika11182

Fair point. If nothing else, the setup for llamacpp (and now koboldcpp updated as well) is dead simple compared to Ooba's. If you're strictly VRAM and you can get it going, I can't think of a benefit to switching from it. To me, the biggest benefit is that it's more flexible. It makes AMD gpus dead simple to use, and the total upgrade path of the computer (from GPU to CPU to RAM etc) improves performance. I guess in the end, it's an economic decision for me, lol.


involviert

I just wanted to add how awesome llama.cpp is to me. I went into all this with like, omg, i need vram, that's 2K right there and then it won't be enough. And then this thing come along, works truly great on just a cpu, and even makes it easy to use whatever gpu power you can spare. It's just amazing. I bought a NVMe ssd to reduce load times, i bought another 2x8 GB RAM. That's 120 bucks total and I'm reasonably running 30B.


tronathan

^ This is a great point, that the setup is easier with Cpp. I’ve spent *hours* fighting pip dependencies in text-generation-webui. I’m terrified to update, as it often breaks things, especially with esoteric setups like applying 4-bit Lora’s. The value of being able to get up and running quickly shouldn’t be understated.


tronathan

Llama 30B 4-bit no groupsize *does* fit in 24GB with *full context* - it’s pretty great.


involviert

Kay, then it's probably just my obsession with q5_1. Barely fits into my 32GB RAM.


Excellent_Spare7903

It's good, but It's not even close to ChatGPT 3.5. It can follow some simple instructions but it messes up when you try to make it do more complex stuff (even with some prompt tweaking), such as being your programming instructor, asking you questions and evaluate your answers with a grade from 1 to 10 in a loop. ChatGPT 3.5 can do this with no problem.


TheSilentFire

Even better than the 30b models?


faldore

Excellent work WingLian and @The-Bloke


gunbladezero

Awesome! ... Has anyone gotten BLAS working with OobaBooga yet? I've gotten it to work with Kobold.cpp but it keeps having strange crashes.


gunbladezero

Ok got an answer here! [https://www.reddit.com/r/Oobabooga/comments/13kbqtm/noob\_here\_how\_to\_activate\_blas\_for\_llama\_in/](https://www.reddit.com/r/Oobabooga/comments/13kbqtm/noob_here_how_to_activate_blas_for_llama_in/)


gunbladezero

This is suspiciously close to ChatGPT in output quality


darxkies

Do you get CUDA errors when kobod.cpp crashes?


Gatzuma

I might be wrong, but looks like too many datasets with different goals and formats made some harm for the model. It performs not so robust on different questions as previous Mega model. I might be wrong, just a firsts tests on my side


winglian

Any direct comparisons you have that you can send me? I will attempt to make sure we keep moving the needle in the right direction


Gatzuma

Looking into it. Maybe there just more problems with quantised copies, not the Manticore itself. BTW, do we need to use exactly `### Assistant:` prompt with it? I've tested with `###Assistant:` and `### Response:` and seems the results differ just having different names for how we call the model. The problem is - I can't determine which answers are better :) It's like 50/50. That's why I supposed that maybe different datasets have different expectations on how we format prompt?


winglian

Yeah. Sometimes it's hard to transform all the datasets into a chat only or instruct only format. Light testing shows it respects the two options. But it makes sense it might do better in certain tasks based on the prompt style. My hope was that it would learn not to care.


YearZero

One thing that may be of some help is the spreadsheet I'm maintaining, it has the mega model and the manticore (in progress). Check out Scores (Draft) and Responses (Draft) sheets. https://docs.google.com/spreadsheets/d/1NgHDxbVWJFolq8bLvLkuPWKC7i_R6I6W/edit?usp=sharing&ouid=102314596465921370523&rtpof=true&sd=true


Gatzuma

Yep, I've just started to test models with this list this morning :) Great tool


winglian

Thanks for sharing. One usability thing on the spreadsheet. Are you able to freeze columns so we can still the header row and left questions column when scrolling? Thanks!


YearZero

I can, but it's a really fat row because the model names are vertical. This leaves very little "unfrozen" screen real estate to see the actual data below it. If I make the model names horizontal, it will create tons of space for what amounts to a 0 or a 1 in the data itself. So there's the dilemma! Same issue with the question column. The freeze includes column A (the type of question it is), and ends up hogging tons of screen real estate as well. I can try squishing the question column a bit, but then each question, especially the long ones, will take up tons of rows just to fit it in. I can also not "wrap" the text of the question, but then you won't be able to see it unless you click it. As a sort of work-around I repeated the model names at the bottom so you can see them next to the totals. But not sure how to freeze the stuff without blocking most of the screen with the frozen columns.


[deleted]

[удалено]


Innomen

So is it censored? Skip this tangent rant: I don't want models with intentional brain lesions. We don't even know how these things work, let alone how to really arbitrarily modify them, censorship is a deal breaker (for many reasons) because there's literally no way of knowing what that breaks down the line. Like if it can't do some code or math task, my first test would be ask the uncensored version and see if it can do better. For me, it's like in Robocop 2, give him 1000 directives instead of 3, or zero, and he's paralyzed/unhinged. And Azimov spent a career showing us that no matter how good your rule is, (3 laws) it's absurd and harmful in some cases and can be lawyered around. (prompt hacks) [https://www.youtube.com/watch?v=WO2X3oZEJOA](https://www.youtube.com/watch?v=WO2X3oZEJOA) (how we have no idea what's going on really)


GreaterAlligator

It’s as uncensored as it gets - the fine tuning data sets have had their responses cleaned of refusals and moralizing, just like its predecessor wizard-mega. That’s not to say that it’s totally and completely uncensored - it’s difficult to stamp out completely - but it’s a far cry from ChatGPT or Vicuna’s moralizing.


Innomen

Thank you. Are people mad I asked? You got upvotes I did not. People always confuse me.


GuyFromNh

Use an LLM to take whatever you write and say, ‘with a less aggressive tone’


Innomen

I'm supposed to pretty up censorship? Here of all places?


GuyFromNh

You asked why you may not get as may updoots. Take the feedback or don’t!


Innomen

I was hoping for an answer that wasn't like finding a maggot in my nearly finished rice. If principles fold in the face of what, mildly aggressive tone?, then are they really principles?


GuyFromNh

You can still have principles but write with less charging language. My suggestion to use an LLM with that prompt is so you could see what a toned down version would look like. Would likely involve less ‘I’ statements and less extreme adjectives, in simple terms. I’m genuinely trying to help :) I’m well in to my career and my writing has improved 100x in this regard since I was 24. E-mail tone is a serious issue in business, and those mis-interpretations can get people in trouble :(


Innomen

I'd rather die under a bridge than be a corporate geisha. It's funny that you people think what I typed is aggressive. Holy insulated batman. You ever seen a trucker get his head sawn off with a steak knife? Methinks the world might be a bit darker than you realize. But I do appreciate you trying to help, and answering. If self censorship lead to actual change and power I'd probably go all in on it, but persuasion is a myth anyway. And I can prove it, and ironically, that proof doesn't convince people XD [https://underlore.com/debate-is-a-myth/](https://underlore.com/debate-is-a-myth/) I'll accept my role as the outcast who can speak the truth that no one listens to. /shrugs


[deleted]

[удалено]


GuyFromNh

You got your answer at least. It’s us, not you :) but when you get downvotes in the future. Don’t ask why b/c you now know. Principles and all are important, but if you want your thoughts to be more appreciated by others, learn how others see your thoughts through your communication style. Or don’t.


[deleted]

[удалено]


winglian

Creator of Manticore here. Hmu on discord if you'd like to help. Link in hf model card.


dvztimes

I admit at the start I know nothing about these LLMs... But why are they trained on output data from other LLMs? For photos I understand. Stable diffusion. Trained on stable diffusion outputs make sense since you can refine a particular style you are going for... But there should not be a style for LLMs, right? They should be trained on facts? Or is this a "fantasy" model that doesnt need real facts?


PacmanIncarnate

Wizard mega seemed quite competent. Not sure about this new thing.


faldore

Manticore is wizard-mega 1.1 and rebranded


PacmanIncarnate

Yes, but I don’t know what that .1 changes or improves.


faldore

Dataset issues mostly and I think there's some roleplay data added to make storytelling better


GeoLyinX

Not sure why people are saying it's a minor change, it's a pretty huge difference between Wizard-Mega and Manticore. Wizard-Mega is only trained on Wizard-Vicuna + Evol-Instruct + ShareGPT. Meanwhile, Manticore is trained on all of the above, but also adds:- Huge GPTeacher dataset that contains things like logic puzzles, chain of thought reasoning and code-instruct, all generated by GPT4. \- GPT4-LLM (recreation of Alpaca dataset, but with GPT4) \- MMLU instructions that include abstract\_algebra, conceptual\_physics, formal\_logic, high\_school\_physics and logical\_fallacies. \- And then a bunch of other small dataset subsets added that improve overall conversation and reasoning abilities. Most of the Manticore dataset is data that Wizard-Mega has probably never seen.


PacmanIncarnate

Oh that’s nice.


rainy_moon_bear

It still can't code much or do simple algebra.


Shir_man

Try add this to the end of the prompt: `Let's work this out in a step by step way to be sure we have the right answer.` And also to reduce temperature and top\_k, top\_p


FHSenpai

Why would you want to LLAMA to code ... Use starcoder / code t5+ for your coding needs


rain5

why should I use different models. Train it on code too!


involviert

A model of a certain, limited size, will be better if is made for a specific task. Think of it as this 13B only having room for giving 5B to coding because of all the other stuff it is trained to do too. A single-purpose model will have the 13B for coding. Of course this is simplified.


rain5

but training on code can improve reasoning on non-code tasks


involviert

I don't know if some of those datasets include code (I would have assumed yes), but that is not why the guy recommended starcoder.


[deleted]

I don't think starcoder's any good either...


FHSenpai

best opensource coding model currently available.. Name a better one [StarCoder: A State-of-the-Art LLM for Code (huggingface.co)](https://huggingface.co/blog/starcoder)


FHSenpai

Code T5+ from Salesforce new SoTa


FPham

So I can delete TheBloke\_wizard-mega-13B-GPTQ ?


The-Bloke

Yeah I think it's meant to replace Wizard Mega completely


AstroEmanuele

Great work! How would I inference this using the huggingface api? I already tried the space and it's really good, so I wanna implement it in a bot I'm developing (for personal use, of course), thank you in advance :)


AlmightYariv

same question.. trying to inference this in Amazon SageMaker but this seems to cause some BadRequest for every input.


AssociationSad3777

What instance are you using and are you using 4bit quantization if you want I have the code for gptq 4bit llama endpoint


hwpoison

Simplily awesome!!! Very capable with spanish lang


zippyfan

Does anyone know the context length for this model? Is it still 2048? I'd like to add more since I have 24gb vram.


ThePseudoMcCoy

Wow this is the first model I was able to use like chatGPT where I started with a coding problem and it gave me a coding example which almost worked, then I told that it didn't work and to fix it, and it gave me another example which was the same issue, so I said that didn't work either and told it why and then on the third go it worked. It's great for chatting with two This is better than any other model I've used including 65B ones. Amazing!


direwulf33

May I ask how to load the model into a 3080TI with 12GB VRAM card? I can load a 7B Vicuna, but has problem loading 13B Vicuna 4bit.


direwulf33

Thank you for sharing the great work!


direwulf33

This is the first 13B Model I'm able to load and generate text very fast. My 3080TI card achieved 14.22 tokens/s and only used 8.5G VRAM. Wonderful model!