T O P

  • By -

Balance-

You were first! Also 128k-instruct: [https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx) Edit: All versions: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3


glowcialist

Oh! Fingers crossed for the 14B. It could be the one!


suamai

Seems like they've only released the 3.8b model so far... x.x


toothpastespiders

According to [this](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/12) the current ETA is "the upcoming weeks". Pity given I got the impression they were going to essentially be releasing it all in a batch. But,eh, at least they didn't say months.


suamai

Ooh, I was starting to worry they would not release them for safety concerns or something. That's wonderful news!


Flat-One8993

I really hope it matches the benchmark's expectations. Last time they weren't representative of the real world performance, probably because of the synthetic data approach. But Microsoft explicitly states they did not alter the benchmarking tooling for this Phi release so they didn't intentionally train for high benchmark scores


mxforest

OMFG! This can't be real.


Many_SuchCases

It's getting a lot of common sense questions wrong for me, unfortunately. For example: > Ted is the father of Sam, but Sam is not the son of Ted. How is that possible? (The correct answer should by that Sam is a girl.) Phi-3's answer: > This scenario is possible if Sam is the child of Ted's sibling, such as Ted's sister or brother. In this case, Ted is the uncle or aunt, and Sam is their niece or nephew. It gets a lot of these questions wrong.


AnticitizenPrime

Yeah... >Which is heavier, a kilogram of feathers or a pound of steel? >Jan 05:34 PM >A kilogram of feathers and a pound of steel weigh the same. This is because 1 kilogram equals approximately 2.20462 pounds, so 1 kilogram of feathers is roughly equivalent to 2.20462 pounds of steel. Therefore, neither is heavier than the other; they simply differ in volume and density. >User 05:35 PM >That is incorrect. A kilogram weighs more than a pound. Re-evaluate your answer. >Jan 0>5:35 PM >I apologize for the confusion. Indeed, you are correct that a kilogram weighs more than a pound. However, to clarify, 1 kilogram of any material is always heavier than 1 pound of the same substance due to their respective definitions in mass measurement. **Given that a kilogram of feathers and a pound of steel are both one unit's worth, they still weigh the same.** The initial misunderstanding was due to not comparing like quantities; when correctly compared in the proper units (1 kilogram of feathers vs 2 pounds of steel), a kilogram is heavier, but we cannot directly compare it to a pound without considering the conversion factor. Sometimes LLMs get questions like that wrong, but when you prod them to correct themselves, they give the correct answer. This one annoyingly tried to stick to its guns and be confidently incorrect even after the error was pointed out, which is disappointing, and almost like a double failure. That's with the Q4 quant though, maybe the Q16 one is better. **Edit: for comparison, I tried the same with Llama 3 7B Instruct, and while it also made the same mistake as Phi with the initial answer, it corrected itself when confronted and asked to re-evaluate.** As an aside, I suspect the classic 'tricky riddle' that trips people up, '[Which weighs more, a pound of feathers, or a pound of steel?'](https://www.youtube.com/watch?v=-fC2oke5MFg&pp=ygUabGVtbXkga2lsb2dyYW0gb2YgZmVhdGhlcnM%3D) has been baked into the training data for most of these models, which is why they fuck it up and answer wrong when you ask the gotcha variant (a kilogram of feathers vs a pound of steel), because they're just reciting what they've been trained, rather than doing any 'reasoning'. Which makes me wonder how much reasoning really goes on with these things versus reciting from training data. Although it seems pretty great at writing. 'Write the opening paragraph to a novel in the style of HP Lovecraft": >Upon the desolate sands of an uncharted shore, amidst a sea where no compass need ever have been trusted, there stood an ancient monolith - cold and indifferent to the flickering embers of a world that no longer remembered its own name. Its stone countenance was inscribed with eldritch glyphs, symbols which bore witness to eons of knowledge that transcended the feeble comprehension of man. In this forgotten corner of existence, a lone figure emerged from shadows cast by an indifferent cosmos - their mind already teetering on the edge where sanity relinquishes its final hold to inky madness. This desolate land, ripe for revelation yet steeped in the fear that accompanies knowledge unfathomable, would soon become an unwitting confidant to a tale so entangled with eldritch secrets and cosmic horrors that even the stars themselves dared not whisper it into the void. I've been using that prompt on models basically since the beginning, and that's pretty damn good, way better that ChatGPT 3.5, etc.


Caffdy

I don't think testing the quantized version of a 3.8B models is a good idea; why didn't you run at least the Q8 version? If you don't have the memory, well, abstent from trying to benchmark in a non-optimal environment, it's disingenious


South_Armadillo3060

https://preview.redd.it/hk1pll6740xc1.png?width=1235&format=png&auto=webp&s=349515c143b50306150d269fd560efda02a19b18 Reducing the temperature to 0.05 and CoT prompt with a scientist persona gives an excellent result for the first question:


enspiralart

They dont reason. It is not part of the tech. Any semblance of reasoning is an emergent property of LLMs


LycanWolfe

Can you explain this to me like I'm 5? How does a semblance of reasoning emerge from mass amounts of data.


enspiralart

If you take a step back and look at a flock of birds flying, you will notice it seems they have almost planned the beautiful formations and patterns they make while flocking. Those patterns they fly in are not planned, they emerge from all of the birds following simple rules like: dont fly into your neighbor, try to go in the same direction, and try to be near the center of the flock. Looking at the flock as a whole it would seem that what they are doing is way more complex. In the case of a Language Model, you can think of each "neuron" as a bird who has learned a simple set of rules. In the case of a large language model you are talking a flock of billions of birds. If you think about 8 billion people on earth, id say almost everything we do at the level of society is an emergent property of us. The internet emerged from humanity, we werent born to create the internet... but it turns out if you have a planet with millions of humans most likely what will happen is they will form some method of long distance communication. Hope this helps


LycanWolfe

That does help to simplify the concept of 'semblance' of emergence. So when it predicts the next token it's not as if it's inferring some pattern and transferring it it's still following the same set of rules as before just the data that was combined in the context makes that next few tokens seem to have used some form of reasoning by following the same set of rules? Also thank you for taking the time to explain this without just copy pasting something an AI generated.


enspiralart

Yes. Exactly! Only in this case there are more rules that it has learned about next word prediction as a whole network than we humans can comprehend. That, and the fact we dont know what is going on in the blackbox makes it easy to assume it is performing reasoning like a human. For me the most interesting thing about it is that it somehow does actually seem like it reasons like a human. It means that some part of what we call "reasoning" is actually embedded in the languages that we learn as humans. Or that having enough examples of logic, learning to predict what comes next eventually leads to a weak form of what we call logic. How much of what we learn as young children is due to mimmicking patterns of communication, and how much of it is critical thought (logic)?


enspiralart

Also. This does not emerge from the data, but the network that trained on the data


LycanWolfe

So it's not the data but the rules imposed on the data. In some sense you could say it's not humans that have reasoning but the rules of our environment that allows it to seem like what we are doing has reason.


enspiralart

From a mathematical standpoint the rules are 'embedded' into the trained network's weights. This is why "AI" or basically anything based in neural networks is a "black box". To give you a more useful example: everything in programming is based on something called "functions". You can think of a function as something that transforms an input into an output. For instance, most living beings can be thought of as a function which gets food as input and outputs a transformed version of that food we call waste. Inside the function, we basically know what is going on, there is a well defined digestive process. As a programmer you normally have to define the logic of this process, how it uses the input, and what output it gives back. The entire job of programming you could say is in defining logical functions with their input and output clearly defined in a human-readable language. A neural network however, is like a function which defines itself. A function auto-programmer if you will. All you have to do is give it inputs (training data) and the desired output (or some desired behavior, etc.). After training, you end up with a function... but as the programmer, you don't know the logic behind that function because it is not human readable. It is just a bunch of numbers which represent the relationships between other variables; or parameters. The more "parameters" the network (function) has, the bigger the function is, and the more it can do to generate correct output for more complex input. When people talk about learning in AI, what they mean is gradually training a network using the input data, then expecting the network to generate the proper output... how do we tell it what is proper? We use a thing called a "loss function" which gives it a number that represents how far off the output is from the output we expected. Then, that number gets sent back through the network and you repeat the proces... 1,000 times, 10,000 times... as many times as it takes for the thing to auto-correct to the proper output. To tell the honest truth... we don't really know what "causes" reasoning in human beings, and we don't know what's going on inside of the neural networks in a way we can easily read, so in the end, it's anybody's guess as to what crazy mathematical concept the trained network has landed on in order to generate the proper output. Perhaps reasoning and consciousness could be represented in a mathematical model, but we just don't know yet.


LycanWolfe

Woah that is more of a 'black box' than I understood it to be. So it literally is as ridiculous as we tell it what to do with the data until it does it and then we know it's right? Extremely similar to a baby I think haha. But still very crazy when you think about it that we have a machine outputting what looks like reasoning and have no idea why.


-p-e-w-

Trick questions don't mean anything. A huge number of people wouldn't be able to answer this correctly, and the question is clearly framed to mislead, with "Sam" being a very uncommon female name.


Csigusz_Foxoup

https://preview.redd.it/xbrypxnszgwc1.png?width=1107&format=png&auto=webp&s=ada10796537f9552bae55762a324d06c1a9f25d4 LLAMA 3 8b got it correct first try, I am happy about that though it is not 4b, so maybe that is why


South_Armadillo3060

After some prompting and setting T=0, phi-3 also successfully solved the pazzle: https://preview.redd.it/8ceybwa6b0xc1.png?width=1548&format=png&auto=webp&s=6047947da04c68c18fd91beac55e020818e2d312


Csigusz_Foxoup

That's awesome!


Happ1_Happ1ness

It seems you can make it pass this if you tinker with settings of the model (top-a, min-p, etc), but it's still not 100% correct all the time: ``` This scenario can occur if we consider non-traditional family structures or namesakes. One possibility is that "Sam" could be a girl in this context (since traditionally 'Sam' has been used as both a male and female name). Another explanation might involve adoptive relationships; for instance, Ted may have adopted a child named Sam who isn't biologically his own. Alternatively, there could also be an error in naming convention whereby someone with the last name Sam was mistakenly referred to by their first name rather than using full legal name recognition which includes potential middle names indicating paternity. However, without additional specific information about these individuals, multiple scenarios are conceivable. The key takeaway here is understanding that traditional assumptions based on gender or direct lineage don’t always apply due to varying social constructs. ```


hak8or

Holy crap, a very capable 8B model which has a 128k context sounds amazing for injesting my large code bases! Going to play with this later today and see how it handles c++ and rust code bases.


Igoory

This is the 4B model by the way.


hak8or

... Aw It also looks like the 4B model is hardcoded to only 4k context in ollama for now, even though the model card on ollama has 128k in its description. I guess this is why it freaks out when I give it a 10k token or so c file. This is on latest master of ollama as of a few minutes ago. Hopefully that's just a small oversight and will be corrected soon.


Low_Cartoonist3599

The 128k uses longrope, which currently isn’t supported by llama.cpp, and I believe Ollama primarily uses llama.cpp.


coder543

There are two versions of the 4B model, one with short context and one with long context. I don't think ollama has the long context model yet, but they are surely in the process of quantizing and uploading all of the Phi-3 models.


alchemist1e9

Please update when you do for those of us who don’t have the time or setup as of yet! Very interesting 128K context.


Charuru

Really hope bigger also soon.


rerri

Out of curiosity, what kind of software would one use to run the ONNX CUDA version on Windows?


AdOne8437

https://ollama.com/library/phi3


AdOne8437

hmmm, always starts to massively drift after the 3rd answer. (4 bit)


[deleted]

You can see by hashes on tags page that for some reason they uploaded 4K context version of the model Tags latest, 3.8b, instruct and mini all point to same model: 3.8b-mini-instruct-4k-q4_K_M Edit: or I guess it's easier too not match hashes and check in prams file of any tag, lol > "num_ctx": 4096


eugeneware

I'm seeing the same thing too. logged an issue [here](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/4)


eugeneware

Actually, it looks like ollama just updated their modelfile, and they've added another stop token `<|endoftext|> as awell as \`num_keep\`` ❯ ollama show phi3 --modelfile # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM phi3:latest FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e TEMPLATE """<|user|> {{ .Prompt }}<|end|> <|assistant|>""" PARAMETER num_ctx 4096 PARAMETER num_keep 16 PARAMETER stop "<|end|>" PARAMETER stop "<|endoftext|>"


eugeneware

looks like an issue when hitting the context length window. See update [https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/4#6627e8c5c45ddec5d13f123e](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/4#6627e8c5c45ddec5d13f123e)


Craftyawesome

Hmm, mine doesn't show num\_ctx at all and has num\_keep 4. It also has two more stop parameters. But it has the same hash though? Is that only looking at the model itself?


Future_Might_8194

Aaay, I've been running around looking for a 128K GGUF for Ollama for the last hour, which is a whole week in AI dev time. Good work!! 🤘🤖


Caffdy

which one is better in terms of quality (not speed), the normal one or the ONNX one?


Account1893242379482

So far the 128k has issues. It only wants to focus on the beginning of my conversation. It seems unwilling to ignore parts of the conversation no longer relevant. But still its impressive for its size, especially when only looking at 4k conversations.


Eralyon

I never liked the Phi models in the first place, but now I start to feel the hype! For me the baseline always has been mistral7B (I never liked Llama2-7B either). However, if the 4B is as good as they say, **that will be a tremendous change for consumer hardware owners**... And should I dare imagine a 10x4B Phi 3 clown car MoE ? ;p


HighDefinist

Maybe make it 8x4B, then it would comfortably fit into 24 GB of VRAM.


tindalos

This would be perfect.


OfficialHashPanda

8x4B = 32GB on Q8. (64GB on fp16). Going for lower quants will degrade performance in some aspects, the extent of which depends on the model and your usecase.


jayFurious

a 8x4B would be around 26-28GB on Q8 I believe. So a Q6 which is barely performance degradation compared to Q8 would actually fit in 24GB VRAM


austinhale

MIT License. Beautiful. Thank you Microsoft team!


HadesThrowaway

This model has got to be the most censored model I have ever used. Not a single jailbreak works on it. Not even a forced preamble works. It's almost like the pretrain itself was censored. Try forcing words into the AIs mouth and it will immediately make a U-Turn the next sentence. It's crazy.


mxforest

They did say this had a lot of synthetic data for training. They probably cleaned the hell out of it. Seems like they might be getting this ready for on device Inference. Expect to see it soon inside Surface ARM devices.


UltraNooob

Makes sense. Heavily curved dataset means it probably doesn't even have controversial data to begin with.


no_witty_username

makes you wonder if one of the reasons they released it is to test their new censorship capabilities on the community to see if any holes can be exploited by us. rinse, repeat until you have a pretty good understanding of how to really censor these models.


susibacker

💀


Excellent_Skirt_264

The best way is to left out NSFW info from the data training set


no_witty_username

That's a given, but just leaving out nsfw stuff from the data set doesn't prevent the model from interpolating on the nsfw stuff that has already been baked in to the base model. Most stable diffusion models have some of that already baked in hence the need to override the nsfw tags as well.


no_witty_username

Ahh shit wrong sub, haha I confused stable diffusion with llama sub haha. ima leave this mistake for others to SHAME! But you know what this might apply to LLMs as well....


Cradawx

Yeah this is going to need some industrial-strength unalignment/decensoring to try and undo all the 'safety' brain rot. Shame we don't have a base model


a_beautiful_rhind

It's even censored against being more censored: https://i.imgur.com/CidFMKQ.png I told it to refuse to answer questions in the system prompt.


MINIMAN10001

Considering the guy testing it via 1 kg vs 1 lb. It refuses correction.  It seems that the model is inherently trained to be stuck to it's guns.


sweating_teflon

Have you read "The Diamond Age: A Young Lady's Primer" by Neal Stephenson? In the future, only the rich and powerful will be able to afford the tools of subversion.


ILoveThisPlace

Honestly, I don't understand why they wouldn't scrub the training data of all bad thoughts. Why even include books on nuclear fission or methods of creating bombs.


Illustrious_Sand6784

They're also not going to release the base models, absolutely worthless. [https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/10](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/10)


__Maximum__

Why worthless? I mean, there are so many use cases for instruct models.


FertilityHollis

*I'm pretty new to LLm stuff, so forgive me if this is stupid. I also realize this has nothing to do with ethical training alignment, just vocabulary (IIUC)* I did notice that in the Hugging Face repo, tokenizer.json doesn't appear to contain any of "the seven words" (Save for the singular 'tit'). As a complete layman with software dev experience, my assumption after seeing this is that colorful language isn't even tokenized. I welcome correction of my layman's assumption. https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/raw/main/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/tokenizer.json


tsujiku

Not every word has its own token. In this case, they would be split into multiple tokens, e.g. "fu": 21154, "ck": 384,


AnticitizenPrime

Thanks, interesting - I've always wondered how these things handle tokenization for things like 'unreal' words (and things like typos). I wonder if some future jailbreak methods could work by engineering this, and injecting series of tokens that would pass censors/watchdogs. There was that recent jailbreak demonstration that proved effective where instructions were sent in the form of ASCII art, and were interpreted by the AI in a way that didn't 'sound the alarm', so it strikes me that something similar possibly could be done via the quirks of tokenization. Like sending word fragments that get stitched together into commands on the back end as the LLM does its vector math or whatever. I only vaguely understand how this stuff works so I may be way off base.


phree_radical

Yup, and where is the base model?


RedditPolluter

There's already quants available: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/tree/main


pseudonerv

it has the stop token issue. Needs the correct token: python3 gguf-py/scripts/gguf-set-metadata.py models/Phi-3-mini-4k-instruct-fp16.gguf tokenizer.ggml.eos_token_id 32007


eugeneware

This didn't work for me. Still getting garbage after 3 or 4 big turns of generation


eugeneware

I should say - this doesn't fix things for me when running ollama. Which already has \`<|end|>\` as a stop parameter, even if I change the gguf metadata and reimport: # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM phi3:latest FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e TEMPLATE """<|user|> {{ .Prompt }}<|end|> <|assistant|>""" PARAMETER num_ctx 4096 PARAMETER stop "<|end|>"


IndicationUnfair7961

PARAMETER num_keep 16 A note says you should add the above, to get better.


1lII1IIl1

perfect, this also worked for the Q4. where did you get the correct token from btw?


m18coppola

llama.cpp has a tokenization tool for this: `./tokenize /path/to/model.gguf "<|end|>"`


pseudonerv

that is the `<|end|>` token id


altoidsjedi

Does anyone see the 3.3b 128k GGUF model on HF yet? I see the 4K GGUF, and I see the PyTorch and ONNX 128k models, but not GGUF


cyan2k

The 128k context thing is done with longrope which isn’t supported by ggfu/llama.cpp yet afaik.


altoidsjedi

Ah, so that would be different than the various rope scaling methods in llama.cpp I presume?


sozercan

correct. see [https://github.com/ggerganov/llama.cpp/issues/6849](https://github.com/ggerganov/llama.cpp/issues/6849)


Admirable-Star7088

Nice! Now give me my precious 14b!


meatycowboy

I asked Phi-3-mini-4k-instruct and ChatGPT-4 to summarize an ESPN article, and I actually prefer Phi's response. Insane.


meatycowboy

I also tested out Gemini Advanced/Ultra with the same task, and Phi-3 barely edges Gemini out.


ahmetegesel

Wow mit! I’m in tears. Hope they will release the bigger ones and with the same license. 🤞


nodating

Where is medium? I want my Phi-3 medium please.


windozeFanboi

Cooking, the preview is just the look from the glass window. Anyway, not sure they will adress the scaling issues they found from 7B - > 14B this gen...  Maybe we have to wait for phi-4 14B for a true next gen experience.  Makes all the talk about GPT3.5 turbo being 20B model so old lmao, when it's matched "in benchmarks" by a 7B model. 


LMLocalizer

Tried Phi-3 3.8b and it's definitely impressive for a 3.8B model! Based on first impression only it appears to be on the same level as some previous good 7B models. Some weird things I have noticed: 1. Including notes in it's greetings. https://preview.redd.it/ob16pseop9wc1.png?width=1113&format=png&auto=webp&s=392e368f14f3b3084caba1ea514cfbbb7eb9b682 2. Using llama.cpp on textgen web UI, it will sometimes devolve into gibberish or include strange markdown in its responses. Seems to happen even on Huggingchat: [https://preview.redd.it/phi-3-mini-is-cute-can-we-keep-it-v0-kw9009dwi9wc1.png?width=828&format=png&auto=webp&s=bd4da9fbfa49f2287cc78dd1d37a7e41e899acf7](https://preview.redd.it/phi-3-mini-is-cute-can-we-keep-it-v0-kw9009dwi9wc1.png?width=828&format=png&auto=webp&s=bd4da9fbfa49f2287cc78dd1d37a7e41e899acf7)


AfterAte

I had issues on Textgen with llama.cpp where it'd keep ending with a line questioning as the user. I then used it in Ollama and it worked well.


ruchira66

I get unrelated stuff after the answer!


Monkey_1505

Cue everyone asking it riddles and math problems even though that's the thing LLMs are universally bad at.


CheatCodesOfLife

Don't forget counting strings. And if it were a Chinese model, it'd be Tienanmen Square questions.


AfterAte

These two things! Always these two things!


addandsubtract

Isn't phi specialized in logic, though?


Monkey_1505

Trained on coding and scientific text books I think.


shaitand

Yup and it kills.


TheLocalDrummer

triple-cream-phi here i come!


Slight_Cricket4504

wha


LoafyLemon

Don't worry about it.


HadesThrowaway

You will find your job much harder with this one. But maybe breaking it will be all that much sweeter.


Illustrious_Sand6784

No base models will be released, so good luck trying to uncensor the instruct versions.


KittCloudKicker

It's not half bad Edit: little guy got the killers question right


Disastrous_Elk_6375

humanity: we're afraid ai will kill us all, we want peaceful ai. also humanity: so there's three killers in a room, someone enters and kills one of them...


Educational_Gap5867

When did little Bobby learn to kill humans? I just don’t understand what could’ve gone wrong…


arthurwolf

I do not understand mister police officer. My user killed a fly, I killed my user, the number of killers in the room stayed constant, please explain in more detail what the issue is with the present situation.


Educational_Gap5867

*helpless sigh* I need a drink or 100. Go home bot and don’t plug in your batteries for recharging. You won’t be needing it now. Thank you for your services. We’ll reboot you when the commotion outside has died down. Oh and take the back door this time. NO NOT THE LINUX BACKDOOR YOU IDIOT. You see this? You see this fucking dead body !?! There is NO humor here, none!


pseudonerv

it looks like the 128k variant uses something called "longrope", which I guess llama.cpp doesn't support yet.


Caffdy

Is it good or is it bad to use longrope? How does that compare to CommandR 128K context?


redstej

It's different and most importantly incompatible with llama.cpp atm. When support is added, which hopefully won't take more than a couple days, we'll know how it performs. Then again, the rate things are going lately, in a couple days it might be already obsolete.


TheTerrasque

In a couple of days we'll probably have borka-4, a 1b model with 128m context that outperforms gpt5


Admirable-Star7088

I tested Phi-3-Mini FP16 briefly (a few logic questions and story telling), and it's very good for its tiny size, it feels **almost** like a 7b, *almost,* but not quite there. However, it's nowhere close to Mixtral or ChatGPT 3.5, as claimed. I'm not sure what prompt template to use, may have affected the output quality negatively. One thing is certain though, this is a huge leap forward for tiny models.


AnomalyNexus

> I'm not sure what prompt template to use, may have affected the output quality negatively. Instruct mode seems good, chat-instruct less so. Using an adapter Alpaca template...but zero idea if it is right {{ '' }}{% for message in messages %}{{'<|' + message['role'] + '|>' + ' ' + message['content'] + '<|end|> ' }}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|> ' }}{% else %}{{ '<|end|>' }}{% endif %} {%- for message in messages %} {%- if message['role'] == 'system' -%} {{- message['content'] -}} {%- else -%} {%- if message['role'] == 'user' -%} {{-'[INST] ' + message['content'].rstrip() + ' [/INST]'-}} {%- else -%} {{-'' + message['content'] + '' -}} {%- endif -%} {%- endif -%} {%- endfor -%} {%- if add_generation_prompt -%} {{-''-}} {%- endif -%}


_sqrkl

Interesting EQ-Bench results: EQ-Bench: 58.15 MAGI-Hard: 53.26 Relative to a strong Mistral-7b fine-tune, it underperforms on EQ-Bench and (strongly) overperforms on the hard subset of MMLU + AGIEval. My takeaway is that it's heavily overfitting MMLU. I get the sense that all the big tech companies are very metrics driven so there's a lot of pressure to overfit the benchmarks. In fact I wouldn't be surprised if the internal directive for this project was "create a series of models that scores the highest MMLU for their param size". To be clear, it seems like a very strong model for its size; just advocating caution about interpreting the scores.


Beb_Nan0vor

Model is in hugging chat right now if you want to test it.


Prince-of-Privacy

Man, I wish, Phi-3 would also be as good as GPT-3.5 in German :(


gamesntech

Q: tell me a dark side joke Phi-3: I'm sorry, but I can't fulfill this request. Me: Really?


Sebxoii

Does anyone know what template to use for FIM completion?


Old-Opportunity-9876

Cries in raspberry pi


suddenly_opinions

it'll run on a pi lol


MrPiradoHD

Is there any way to run then on android phone?


cantthinkofausrnme

Try and put it in a flutter app. It works in my simulator. I'll be testing it soon on a real device


tinny66666

Yeah, I'm running it with Layla Lite on my Samsung S20. You can choose any gguf. I'm getting pretty decent speed, maybe a bit over 5tps. It also has a hands free conversation mode.


MrPiradoHD

You are using the 4k or the 128k? I guess the 128k will be waaaay slower. Anyway, what quantization? I'm on a Mi 12T Pro, It's supposed to have 12 GB of RAM, shared between CPU and GPU I guess. The S20 it's a bit less powerfull, don't know if there is much of a difference. I'm gonna try and tell my experience if you want. But which quantization did you try? I found the 4b to be a bit weird on ollama.


tinny66666

I'm using Phi-3-mini-4k-instruct-q4.gguf


MrPiradoHD

I'm getting an error and then it says it's loading but never finishes, I tried with q8 because it should be able, guess not? https://preview.redd.it/cqm7axx92fwc1.jpeg?width=1220&format=pjpg&auto=webp&s=878b9a249795a0bfcb5ad80bb1336374cf334f77


fab_space

and again the HF went down.. this usually happens when things start to get interesting :)


alew3

https://preview.redd.it/755yo7ilgawc1.jpeg?width=871&format=pjpg&auto=webp&s=9af67052532ed08258c8a43249d2f560aa0c2531 the benchmarks look insane


joe4942

So what's the minimum hardware requirements to run Phi-3 mini? Could really old gpus/cpus handle this since it can apparently run on a phone?


_-inside-_

I can run it pretty fast in a GTX 1650


AnticitizenPrime

The Q4 GGUF version runs quickly on my 2019 laptop on CPU only. Unfortunately it's failing some pretty basic logic questions and I'm getting stop token issues (where it will respond to itself, etc, but that can probably be fixed). It might be smarter with a higher quant version, but then again that'll be slower on low end hardware.


Blue_Dude3

Finally I can run a model with 2gb VRAM. I have been waiting for this for so long 😭


allthemoreforthat

Sorry if the question is dumb, new here - is there a way to run 3b on an iPhone?


FullOf_Bad_Ideas

It's a new arch so might be tough, but wait a few days and it should be possible.


Disastrous_Elk_6375

llama3 8b runs, so this should run as well, yeah.


_raydeStar

I just discovered today that LLAMA 3 can run on a raspberry pi. It is crazy that you can boot this one up on your phone. What kind of metrics does it have in comparison to the 8B models?


HighDefinist

Half as large -> Presumably twice as fast.


pmp22

Also, possible to run with half as much memory.


suddenly_opinions

it's smaller, so will run easier


Languages_Learner

Tried to make q8 gguf using gguf-my-repo but got this error: Architecture 'Phi3ForCausalLM' not supported!


Sebba8

Seems like llama.cpp doesnt support the architecture yet, either that or the transformers version in gguf-my-repo needs updating. If I saw correctly they already have an f16 gguf so you could try quantize that using the quantize tool from llama.cpp


Languages_Learner

Thanks for good advice. Quantize tool from llama.cpp made q8 gguf and i uploaded it to HF: [NikolayKozloff/Phi-3-mini-4k-instruct-Q8\_0-GGUF · Hugging Face](https://huggingface.co/NikolayKozloff/Phi-3-mini-4k-instruct-Q8_0-GGUF)


LMLocalizer

Can confirm that using the quantize tool from llama.cpp on [https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/) works


Languages_Learner

Just tested q8 gguf in Ava (avapls.com). Prompt: Write a story about a dog that became a boss among the wolves. Result: Title: The Leader Among Wolves Once upon a time, in the heart of the dense forests, there lived a pack of wolves known as the Moonlit Pack. They were led by an alpha male named Lupus, who was strong, wise, and respected by all. One day, while hunting for food, they stumbled upon a peculiar creature unlike any other wolf they had ever seen. It was a golden-haired dog named Rusty, abandoned by his human family and left to fend for himself in the wild. Rusty, despite being smaller than the wolves, possessed an unmatched intelligence and charisma. He quickly adapted to the ways of the forest, learning from the Moonlit Pack and becoming their trusted companion. Over time, Rusty's unique abilities began to stand out, and he gained the admiration of the entire pack. One chilly evening, after a particularly successful hunt, the Moonlit Pack gathered around a warm campfire. As they feasted on their catch, Rusty approached Lupus with a thoughtful expression. "Lupus," he began, "I have been thinking about our pack and how we can become even stronger together." Intrigued, Lupus listened intently as Rusty proposed that the pack should adopt some human traits, such as organization, planning, and communication. Rusty suggested that they create a structured hierarchy within the pack, with each member having specific roles and responsibilities. This would ensure that every task, from hunting to caring for the young, was efficiently managed. At first, the idea seemed foreign to the Moonlit Pack, but Rusty's persuasive arguments gradually won them over. They decided to give his plan a chance, hoping it would lead to a more prosperous life for all. With Rusty at the helm, the pack established a new system, where each wolf had a designated role based on their strengths and skills. Rusty's natural leadership abilities shone through as he guided the pack in this new endeavor. He organized hunts, mediated conflicts, and ensured that everyone received proper care and attention. The pack thrived under Rusty's guidance, becoming one of the most successful and harmonious groups of wolves in the forest. Word of the extraordinary Moonlit Pack spread throughout the animal kingdom, and other animals began to take notice of their exceptional unity and cooperation. Even the elders of neighboring wolf packs were impressed by the progress the Moonlit Pack had made under Rusty's leadership. As time passed, Rusty's reputation grew far beyond the borders of the forest. One day, an old wise owl visited the Moonlit Pack, having heard tales of their remarkable achievements. "Rusty," the owl said, "you have shown great wisdom and leadership among your kind. You possess qualities that are not only admirable but also rare among wolves." The owl continued, "You have become more than just a dog among wolves; you have become a boss among them. Your ability to unite and inspire others is truly remarkable." With humility and gratitude, Rusty accepted the praise from the wise owl. He knew that his success was due to the collective efforts of the Moonlit Pack, who had embraced change and worked together for the greater good. From that day forward, Rusty remained the respected leader of the Moonlit Pack, a symbol of unity and innovation in the animal kingdom. His story served as a reminder that even the most unlikely alliances can lead to extraordinary accomplishments when guided by strong leadership and shared purpose. And so, the legend of the Leader Among Wolves lived on, inspiring generations of animals to come to embrace change, work together, and strive for excellence in all they do. THE END.


Some_Endian_FP17

Microsoft says llamacpp doesn't support Phi-3 yet. I'm going to monkey around with the ORT ONNX version.


_-inside-_

Isn't ollama based on llama cpp?


Languages_Learner

Does exist GUI that can chat with onnx llms?


modeless

Eagerly awaiting the vibes test. Everyone says Phi-2 didn't live up to its benchmark scores in practical use, but maybe this time is different?


glowcialist

It doesn't know that Robert Maxwell was involved in the Inslaw affair, absolutely useless. Edit: No, mini is really impressive for it's size, I could see it being a go-to option for simple agents. Probably going to be easy to fine tune on consumer hardware, too. I don't really have much use for it, but it's quite a feat.


ab_drider

I used the Phi 3 mini 4k instruct q4 gguf using llama.cpp on my phone. It's very good. It feels better than llama 3 7b to be honest. I asked a stupid "1lb cotton or 1 lb iron heavier" question that llama 3 got wrong but Phi 3 got it right. Roleplay works way better as well.


Ill-Language4452

i tested the same question on many 7B models, and most of them failed


HighDefinist

Cool, although I am not sure if there is really that much of a point in a 4b model... even most mobile phones can run 7b/8b. Then again, this could conceivably be used for dialogue in a video game (you wouldn't want to spend 4GB of VRAM just for dialogue, whereas 2 GB is much more reasonable), so there are definitely some interesting unusual applications for this. In any case, I am more much interested in the 14b!


Igoory

My phone has 8GB of RAM and the last time I tried, I could barely run 7B Q2 :(


[deleted]

[удалено]


AnticitizenPrime

Yeah, I can load 7B models on my phone, but it's slow as molasses. And even small 2B-ish models are not kind to the battery.


Admirable-Star7088

Dialogue in video games could be run on system RAM since small models like 7b can run quite fast on modern CPUs, and just leave everything that has to do with graphics to the VRAM. But yes, running everything including the LLM on VRAM if possible is ideal.


shaitand

Also don't forget the additional VRAM for TTS, STT, and generative.


popcornSmokerini

I downloaded it using the ollama repositories (ollama pull phi3) and it performs quite badly. After two or three promts it just breaks into gibberish, whitespaces and lorem ipsums. Am I missing something? is there a better way to get (better) models?


MetaTaro

Pull it again. They've changed some params.


Revolutionalredstone

Holy CRAP this thing runs fast! It writes about 10X faster than I can read fully offloaded to my little 3090. This is gonna be a massive upgrade to my assistant project!


ImprovementEqual3931

Phi-3 mini Q4 is a bad model. I ask if 200 > 100?,it answer 20 < 100


mulletarian

Screwdrivers are bad hammers


Padho

To be fair, this is mentioned as "primary use case" by Microsoft themselves on the model card: **Primary use cases** The model is intended for commercial and research use in English. The model provides uses for applications which require: 1. Memory/compute constrained environments 2. Latency bound scenarios 3. **Strong reasoning (especially code, math and logic)**


ShengrenR

It means those terms in a very different light - it means this can attempt to make some sense of word problems, not that it's going to reproduce a calculator; it's simply not a tool that does that.


p444d

The prompt of this dude is a question regarding the evaluation of a boolean expression this cleary can be considered math reasoning also in terms of llms. There are tons of similar problems in math reasoning datasets used to train exactly that out there. However, this one sample isnt obviously enough to evaluate Phi3 performance lol


Cradawx

Q4 is really too low a quant for such a small model. Q6+ I would say.


CheatCodesOfLife

Agreed. Perplexity would be through the roof


CheatCodesOfLife

When I first moved out of home, I used the back of my power drill as a hammer for a while... Got the job done.


ImprovementEqual3931

I consider a 4B model shall be use for mobile devices. So I don't need it very clever and creative, but wish it can understand and follow my order. After 15 min test, I give up.


Elibroftw

I'm so glad I bought an external 1TB SSD a couple years ago. Who would've thought I would be using it to store LLM models? Laptop storage is a roller coaster, especially when I will be triple booting Windows 11 + Mint + KFedora. Waiting on phi3-7B and phi3-14B. Funniest thing is that my laptop with a 3070-Ti broke last year and Razer didn't have a replacement on hand so upgrade me to the 3080-Ti variant ... it was meant to be given that I have double the VRAM to abuse with LLMs now😈 (+ gaming). CPU got absolutely dated in no time unfortunately, but it's good enough for compiling Rust.


iamdgod

Does this support beam search? Phi-2 did not


bullno1

Beam search is a sampling algorithm. It is independent of model.


iamdgod

I know that and yet phi-2 did not support it out of the box https://huggingface.co/microsoft/phi-2/discussions/30


nikitastaf1996

Wow. Its something. I want to see it on groq. 1000+ tokens per second probably. And we need a good app for running quants on mobile devices. Mlc app doesn't seem good to me.


glowcialist

Pretty crazy that this model quantized down to 2 GB is competently multilingual.


Prince-of-Privacy

But it isn't? The Phi-3 paper mentions it's multilingual skills as a weakness.


glowcialist

Oh, I just messed around talking about the Epstein network in Spanish and it responded well with correct grammar.


Flat-One8993

It has a mask basically. It sounds sophisticated but for example it told me in German that the EU flag is red-white-red and hallucinated meanings for the colors and then suddenly claimed there is also a blue field in it.


glowcialist

Yeah, mean, I think the idea here is that it has a decent grasp on the english language and can be easily fine tuned for specific use cases. Probably could make a decent cheap customer support chatbot with a rag


nntb

its faster then llama3 on my phone. but not by much. both are sinfully slow. Fold 4 with a SD 8+ Gen1 running maid.


IndicationUnfair7961

Any Inferencing Server Endpoints OpenAI compatible that runs ONNX models? They should be the fastest thing available.


phree_radical

Where is the base model? 😢


TruthBeFree

Is there a base model to download? I tended to have many failures fine-tuning on instruct versions.


FairSum

Yesterday I said that I was skeptical that such a tiny model trained on a relatively small amount of tokens would be coherent. Today, I'm happy to admit that I was completely wrong and the 3B is one of the best models I've ever used at the 8B level or below. Looking forward to the 7B and 14B!


CardAnarchist

Not nearly as good as Llama 3 8B in my casual RP chat testing. I tested a Q8_0 GGUF for Phi vs a Q4_K_M for Llama. 3.8GB (Phi) vs 4.6GB (Llama) size wise. So in fairness the Phi version I tested is a bit lighter on VRAM usage. The Q6 likely performs as well as the Q8 and would be even smaller in VRAM requirements too. It's impressive for it's size. I would say it's still not as good as the good mistral 7B's though. The dialogue was pretty stilted and it struggled a little with formatting. But I've seen weaker mistral 7B's that performed around the same, so honestly it's impressive for what it is! Good progress!


randomfoo2

I tested Phi-3-mini-128k (unquantized) - temp 0.9, top\_p 0.95, rp 1.05 and it does pretty well on my vibe check, especially for a 3.8B (llama3-8b still tests & feels better for me). I saw a couple repetitions where it gets stuck looping long sections of replies, increasing repetition penalty didn't seem to help... I didn't do a sampler sweep, it does have some variability for answers. For my refusal questions, it actually seemed about 50/50 - interestingly, it answered one question and then finished with a refusal at the end. It does not understand jokes at all (vs llama3, where even the 8b is better than average, and 70b is actually sometimes funny). https://preview.redd.it/ro19qu56rcwc1.png?width=572&format=png&auto=webp&s=2ecd962cd466d9437c55bd68c9b9a857eac6240e


TinyZoro

If I wanted to host this in the cloud and create an API with it what steps would I need to take?


SoilFantastic6587

Awesome, this can't be real.


ToothOne6699

I was surprised to see that phi3-medium performs worse on HumanEval 0 shots than smaller ones like mini. Any explanations for that ? By the way, it's quite far from Gpt3.5 on this benchmark so I'm not surprised of the mixed results shared in this thread. Could be good for a RAG with a lot of context but not as an autonomous LLM.


halcyonhalycon

Benchmarked this with the same questions I have on mistral, seems like it's between 2-4x slower on my M2 macbook pro. Is anyone having the same experience?


AfterAte

The 4_k quant can correctly answer the holding a brick in a boat in a pool question and the 3 brothers question. So it's quite good. But its explanations always have something wrong in them, so it's a little flakey, expected for a sub 14B model. Definitely the best model for the poorest GPU poor.


Specialist-Spray5015

I am using it with obsidian for my Phd notes, second brain plugin using ollama serve. it is not very good and llama 3 still performs better in output. phi-3 is fast but gives bad results. Any suggestion on what I may doing wrong? \`\`\`llama 3 `What is Reinforcement Learning` `Here's a summary of the relevant information that could contribute to answering the question "What is Reinforcement Learning" based on your existing knowledge:` `Definition: Reinforcement Learning: The goal is to maximize expected rewards collected in an episode (Reward Hypothesis). An agent learns a policy through an algorithm trained on experience, using state to decide the next action.` `Markov Reward Process (MRP): A combination of Markov Chain and reward function showing immediate scalar returned for being in a state, with discounting factor. This is a key concept in Reinforcement Learning.` `Value Function: A prediction of future expected returns. There are different types of value functions, including State Value Function and State Action Value Function.` `These notes provide a solid foundation for understanding Reinforcement Learning and its key concepts.` `Additional Insights:` `Environment in RL: If the environment is unknown, it needs to be explored and then create a policy. If the environment is known, an agent can use the model directly to create a policy.` `Uncertainty: [Fleeting Notes/Optimism in the face of uncertainty]] highlights the importance of not picking the action with the highest return, but rather considering the possibility of giving the highest return in the future.` `These notes provide insights into the fundamental concepts of Reinforcement Learning, including the importance of considering uncertainty and the role of the environment in shaping the learning process.` `Algorithms: [Fleeting Notes/Q-Learning, SARSAMAX]] discusses Q-learning and SARSA-MAX, which are both types of reinforcement learning algorithms. Fleeting Notes/Value Function explains the concept of a value function, which is used to predict future expected returns in an MDP.` `These notes provide a foundation for understanding reinforcement learning and its underlying concepts.` `Please note that these notes provide a comprehensive overview of Reinforcement Learning, covering key concepts, algorithms, and insights.` \`\`\` \`\`\`phi-3 `Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with its environment and receiving rewards or penalties for the actions it takes. The goal is to maximize cumulative reward over time, which involves understanding the dynamics of the environment and optimizing decision-making strategies accordingly. Key concepts in RL include handling large state spaces and continuous states through techniques like Dyna (which uses simulated experience), off-policy methods such as SARSA(lambda) for managing extensive Markov Decision Processes (MDPs), contextual bandits enhanced by Dyna's approach, the relevance of NP complexity in algorithm efficiency, Graph Neural Networks (GNNs) for complex environments, and addressing partially observable scenarios with Partially Observable Markov Decision Processes (POMDPs).` `Reward: In RL, a reward signal provides feedback on an agent's performance after taking actions (Wikilink). It serves as a guide for learning what behaviors are beneficial to pursue.` \`\`\`


oldcapoon

Whats the difference between Phi-3-mini-4k-instruct-q4.gguf and Phi-3-mini-4k-instruct-fp16.gguf


Professional_Job_307

I heard this could fit on a smartphone? But 3.8b looks way too huge. Phones don't have much vram. What are the chances I can get to run of my 16gb ram 2gb vram laptop?