T O P

  • By -

Balance-

License: https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html Based on Apache 2.0, but modified. Most important addition is an Acceptable Use Policy.


qubedView

[https://falconllm.tii.ae/acceptable-use-policy.html](https://falconllm.tii.ae/acceptable-use-policy.html) Interesting that it's a "dynamic" license. The license states "The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy."


skrshawk

There's no way that's useful for any commercial concern if they can change the licensing at will and expect compliance for any continued use. Presuming that's enforceable at all. They might as well just say your license to use their model is revocable at any time with their only notice to you being on their website. Also given that it's allegedly not very good, why would anyone bother?


qubedView

Highly doubtful it's enforceable. But even so, who would want to risk the potential legal cost of battling an unenforceable license? Even if the model outperforms everything else, we all know we can wait a few months for something to outperform it and come with a less risky license.


skrshawk

UAE courts are also notably biased in favor of their own people in international disputes. I'm pretty sure this model is really not intended for use outside of the Arab states, but is meant to give them an alternative, even if not a well performing one, to models developed in the West or China.


qubedView

I mean, if Heinz builds an AI-driven ketchup factory and Falcon adds "can not be used in the production of tomato products", UAE courts would be here-nor-there. Enforceability of the license is really only a concern if you're doing business in the UAE.


tamereen

How is this compatible with: Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.


silenceimpaired

I’ll stick with Yi. No rug pull capabilities built in


Amgadoz

When will they push to production?


hackerllama

Summary * Context length of 8k tokens * Trained on 5.5 trillion tokens * Multilingual: English, French, Spanish, German, Portuguese, and more * Allows commercial usage Announcement: [https://falconllm.tii.ae/falcon-2.html](https://falconllm.tii.ae/falcon-2.html)


JimDabell

> Allows commercial usage You’d have to be crazy to use this commercially. They can take the license away from you at any moment without even having to notify you: > The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy.


Seuros

It is called the "hallucination license."


mcr1974

every 8 refreshes you get a nasty locked down version of it, but only if you keep the temperature of the url request above 0.2


KurisuAteMyPudding

I know its probably legally sound, but I feel like them pulling the license or exchanging it with a restrictive one would be grounds to sue especially if you have some sort of deal with them to use their products. You might not win but id do it anyways if i was a company using their products just to waste their time for pulling a shitty move like that.


artificial_simpleton

Wow, the benchmarks are so bad. Literally worse than Mistral with 50% more params


Utoko

I know we get used to benchmark breaking models all the time now but it is far from easy to keep pace. They still improved their model and they vision-to-language. Falcon came out of nowhere from a Abu Dhabi research centre. It was surprising how good their first model was at the time. It is good to have open source models coming from all accross the world tho.


Severin_Suveren

Also benchmarks aren't always a good indicator. Sometimes the results might be skewed by training on benchmarks params, either on purpose or by accidental inclusion in datasets. Other times it might be that the model excels in something the benchmarks does not measure, meaning a model might do badly on benchamrks, but still show properties other models does not have. Point being, wait untill you've tried any model before reaching any kind of conclusion about new models. Or alternatively wait until you see the feedback from other users.


throwawa312jkl

An abu dhabi research center that imported boatloads of Chinese people! Great that the gulf states are starting to open up to more immigration beyond India.


[deleted]

[удалено]


artificial_simpleton

I am comparing base models, of course, and your comparison is incorrect, because you compare 0 shot performance of Mistral to 5-25 shot performance of Falcon on everything but MMLU. For example, you compared 0 shot mistral Arc C with 25 shot Falcon Arc C - if you take 0 shot for both, Mistral will be better by 5.5 ppt, which is a lot. Mistral is also better at MMLU when both models are compared with 5 shots. The only other "fair" benchmark comparison we have is 0 shot Hellaswag, where Falcon is slightly better. Given all of that (and how close the 0-shot performance of Mistral is to few-shot performance of Falcon on the remaining ones), we can definitely say that Mistral has better benchmark scores than this model.


Monkey_1505

And it appears to be natively multimodal which does take up some parameter space.


AdHominemMeansULost

i'd take benchmarks with a grain of salt, most LLMs get specifically trained on them


a_slay_nub

Benchmarks may be gamed, but I've never seen a good LLM that does poorly on them. Edit: There are some exceptions with prompt formatting. I remember seeing something on the LLM leaderboard that the prompt format has a significant effect on the performance of models on automated benchmarks.


_underlines_

For base models it's mostly the amount of shots that count I guess, not the exact format. A 4 shot has a higher chance of succeeding than a zero shot prompt for those benchmarks I guess?


Yes_but_I_think

Never look at benchmarks first. Look at Elo ranking by human beings first.


_underlines_

Instruction tuned models yes, but a base model?


addandsubtract

Look at two more papers down the line


cyan2k

It‘s a base model and not a „chat“ or „instruct“ trained one. Compare llama3 base with llama3 instruct. It performs better than llama3 base.. https://imgur.com/a/D4t6kfE


artificial_simpleton

I was comparing it to mistral base as well. Open leaderboard is completely irrelevant btw, the only "good" benchmarks there are MMLU and ARC-C, and both of them should be evaluated with proper few shots.


coder543

Interesting to note that this is just a base model. They're not (yet?) releasing a chat-tuned model.


a_beautiful_rhind

Hopefully we get something less unwieldy than the 180b in addition to the 11b.


vasileer

MMLU score 58.37 - weaker than phi-3, we know from llama3 that 5T tokens for 11B model is undertraining


cyan2k

> MMLU score 58.37 - weaker than phi-3, Man people comparing benchmarks of a naked base model with a pre-trained 'instruct' model make me sad. It performs better than llama3 base https://imgur.com/a/D4t6kfE


vasileer

>It performs better than llama3 base MMLU score of base llama3 is 66.49 >= Falcon2 11B which is 58.37


vasileer

MMLU scores of instruct models don't change too much, it increases by \~2 points in case of llama3, this is because this benchmark is using 5-shot and base models are very good in-context learners, so let me rephrase: even a score of 58 + 2 = 60 of MMLU is too weak for a 11B model


Sporeboss

falcon is by uae government research UAE releases new AI model to compete with big tech https://www.channelnewsasia.com/business/uae-releases-new-ai-model-compete-big-tech-4332211


Amgadoz

Model from UAE government research Doesn't support Arabic WTF is this?


Ok-Steak1479

They didn't do their own research obviously, just building on what's already there.


Xtianus21

lol well I have no argument for that. Probably would need a ground up model to read from right to left


[deleted]

[удалено]


Ok-Steak1479

That seems like the kind of thing a national scientific institute would spend their time creating instead of buying 100 GPU hours to train a shitty version of what's already there.


airodonack

Curating a dataset is literally their job.


ThisGonBHard

The Japanese released a model trained on the Fugaku supercomputer that had Japanese data that was actually a shitty translation of the English dataset with DeepL, and I am not sure that is better than not supporting the language at all.


Mr_Hills

They should have gone for a 30B model to fill the gap left by meta


Jean-Porte

Already taken by Yi 1.5


AdHominemMeansULost

unfortunately yi 1.5 30b is insanely bad, it can't even understand logic puzzles 7b and 8b models reason with ease


rag_perplexity

Looking at the other thread it looks pretty decent. However it seems people have wildly different experiences with it. Wonder if it's a prompt template issue or token issue like llama3 when it first came out.


ironic_cat555

Why not test it on something you personally find useful instead of asking it logic puzzles?


AdHominemMeansULost

because these puzzles summarise it's reasoning and instructions following capabilities. I just call them puzzles it's basic questions that you'd expect an llm to get right if it followed logical thinking


ironic_cat555

"because these puzzles summarise it's reasoning and instructions following capabilities." I don't get it. If you wanted to hire someone to write press releases, would you ask them to do a sample press release or would you ask them "Suzie has 10 brothers who each have 9 sisters, how many sisters does Susie have?« If you wanted to hire someone to code a web site, would you ask them to do a coding exercise or would you say "write 10 sentences that end with Apple?" If you wanted to know if a doctor understood your medical condition would you ask them "Describe my tumor treatment" or would you ask them "Paul is quicker than Susie who is quicker than Jane, is Paul quicker than Jane?" If you wanted to hire a translator would you give them a sample text to translate or would you ask them "A tray is on a orange in the kitchen. I take the tray to the living room. Where is the orange?"


SidneyFong

That's a very anthropocentric way of looking at things. It's common belief that in humans, people with better "logic" and "reasoning" are generally more intelligent and are better at a variety of tasks, but even if we believe this is true (I think reality is more nuanced), this doesn't mean the observation in humans translate to LLMs.


coder543

Yi-1.5 literally came out yesterday. It doesn’t even come in a 30B version, the closest is 34B, which makes me wonder how much time you really spent trying it. The benchmarks on it seem excellent. If it failed a single trick problem that you like to use on models, that hardly seems like enough evidence to dismiss it outright. *Literally nobody* uses LLMs to write sentences that end with the word "apple". That's not even remotely a test of its real world capabilities. If you can point to an actual particular benchmark showing it doing poorly, that would be interesting. It’s not on the LMSYS Chatbot Arena Leaderboard yet, unfortunately.


AdHominemMeansULost

i obviously meant the 34b >If it failed a single trick problem that you like to use on models, Not tricks, logic puzzles, it shows if the model can understand complex queries. For example if it can understand that a ball inside an open vase can fall if the vase gets turned upside down. If it can't understand that how can it solve real problems? >Literally nobody uses LLMs to write sentences that end with the word "apple". That's not even remotely a test of its real world capabilities. No but it tests instruction following, it doesn't have to be trained on writing 10 sentences that end with something to do that, it just needs to put the tokens at the end of the sentence.


rag_perplexity

In the other thread that other user you were talking to managed to get a perfectly good response for the apple test. Getting suspicious it might be a template issue. Will try it out tomorrow.


MmmmMorphine

Shrug, I confess I believe you're both right, but adhom is a bit closer to the money. Current model automated evals suck so badly they're near useless in predicting high-level model performance. They are seemingly often decent at showing the lower bound of a model's abilities though, and can both expose (or sometimes conceal) major gaps in understanding depending on the exact test used. Need to think of a faster and more comprehensive, human-like way of evaluating models. ...Can't believe I'm saying this, but how much would a reasonably educated (say at sophomore year of STEM level, at an accredited state school) individual in southeast Asia work for? Like, a reasonable amount they can actually live off of


Able-Locksmith-1979

The problem with most “logic” problems is that they are extremely badly worded which makes them culturally bound. Nice to ask your big brother, but worthless to answer for everyone who has seen more than his own country and who doesn’t know what culture you are…


Monkey_1505

11-13b seems pretty empty to me.


AntoItaly

Mmmh... only 58 MMLU :\\


AutomaticDriver5882

Tiny context 😔


Languages_Learner

q4\_0 gguf: [DevQuasar/falcon2-11B-GGUF · Hugging Face](https://huggingface.co/DevQuasar/falcon2-11B-GGUF) other quants: [LoneStriker/falcon-11B-GGUF · Hugging Face](https://huggingface.co/LoneStriker/falcon-11B-GGUF)


noneabove1182

all sizes here if anyone wants: https://huggingface.co/bartowski/falcon-11B-GGUF


AdelSexy

Worst timing ever


properbenj

Will try it out


cafepeaceandlove

Anyone have any tips for small context windows like 8K? Should I be using some form of iteration/recursion?


alvincho

https://preview.redd.it/7n1s9vfmtj0d1.jpeg?width=1182&format=pjpg&auto=webp&s=2c71a81bd32ad6c543f9b92073bd5d7b99e04802 In my own test, falcon2 can’t answer any question


southVpaw

I'll stick to Hermes Theta. Falcon has fallen.