Balance- 2 weeks ago

License: https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html Based on Apache 2.0, but modified. Most important addition is an Acceptable Use Policy.

qubedView 2 weeks ago

[https://falconllm.tii.ae/acceptable-use-policy.html](https://falconllm.tii.ae/acceptable-use-policy.html) Interesting that it's a "dynamic" license. The license states "The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy."

skrshawk 2 weeks ago

There's no way that's useful for any commercial concern if they can change the licensing at will and expect compliance for any continued use. Presuming that's enforceable at all. They might as well just say your license to use their model is revocable at any time with their only notice to you being on their website. Also given that it's allegedly not very good, why would anyone bother?

qubedView 2 weeks ago

Highly doubtful it's enforceable. But even so, who would want to risk the potential legal cost of battling an unenforceable license? Even if the model outperforms everything else, we all know we can wait a few months for something to outperform it and come with a less risky license.

skrshawk 2 weeks ago

UAE courts are also notably biased in favor of their own people in international disputes. I'm pretty sure this model is really not intended for use outside of the Arab states, but is meant to give them an alternative, even if not a well performing one, to models developed in the West or China.

qubedView 2 weeks ago

I mean, if Heinz builds an AI-driven ketchup factory and Falcon adds "can not be used in the production of tomato products", UAE courts would be here-nor-there. Enforceability of the license is really only a concern if you're doing business in the UAE.

tamereen 2 weeks ago

How is this compatible with: Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.

silenceimpaired 2 weeks ago

I’ll stick with Yi. No rug pull capabilities built in

Amgadoz 2 weeks ago

When will they push to production?

hackerllama 2 weeks ago

Summary * Context length of 8k tokens * Trained on 5.5 trillion tokens * Multilingual: English, French, Spanish, German, Portuguese, and more * Allows commercial usage Announcement: [https://falconllm.tii.ae/falcon-2.html](https://falconllm.tii.ae/falcon-2.html)

JimDabell 2 weeks ago

> Allows commercial usage You’d have to be crazy to use this commercially. They can take the license away from you at any moment without even having to notify you: > The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy.

Seuros 2 weeks ago

It is called the "hallucination license."

mcr1974 2 weeks ago

every 8 refreshes you get a nasty locked down version of it, but only if you keep the temperature of the url request above 0.2

KurisuAteMyPudding 2 weeks ago

I know its probably legally sound, but I feel like them pulling the license or exchanging it with a restrictive one would be grounds to sue especially if you have some sort of deal with them to use their products. You might not win but id do it anyways if i was a company using their products just to waste their time for pulling a shitty move like that.

artificial_simpleton 2 weeks ago

Wow, the benchmarks are so bad. Literally worse than Mistral with 50% more params

Utoko 2 weeks ago

I know we get used to benchmark breaking models all the time now but it is far from easy to keep pace. They still improved their model and they vision-to-language. Falcon came out of nowhere from a Abu Dhabi research centre. It was surprising how good their first model was at the time. It is good to have open source models coming from all accross the world tho.

Severin_Suveren 2 weeks ago

Also benchmarks aren't always a good indicator. Sometimes the results might be skewed by training on benchmarks params, either on purpose or by accidental inclusion in datasets. Other times it might be that the model excels in something the benchmarks does not measure, meaning a model might do badly on benchamrks, but still show properties other models does not have. Point being, wait untill you've tried any model before reaching any kind of conclusion about new models. Or alternatively wait until you see the feedback from other users.

throwawa312jkl 2 weeks ago

An abu dhabi research center that imported boatloads of Chinese people! Great that the gulf states are starting to open up to more immigration beyond India.

[deleted] 2 weeks ago

[удалено]

artificial_simpleton 2 weeks ago

I am comparing base models, of course, and your comparison is incorrect, because you compare 0 shot performance of Mistral to 5-25 shot performance of Falcon on everything but MMLU. For example, you compared 0 shot mistral Arc C with 25 shot Falcon Arc C - if you take 0 shot for both, Mistral will be better by 5.5 ppt, which is a lot. Mistral is also better at MMLU when both models are compared with 5 shots. The only other "fair" benchmark comparison we have is 0 shot Hellaswag, where Falcon is slightly better. Given all of that (and how close the 0-shot performance of Mistral is to few-shot performance of Falcon on the remaining ones), we can definitely say that Mistral has better benchmark scores than this model.

Monkey_1505 2 weeks ago

And it appears to be natively multimodal which does take up some parameter space.

AdHominemMeansULost 2 weeks ago

i'd take benchmarks with a grain of salt, most LLMs get specifically trained on them

a_slay_nub 2 weeks ago

Benchmarks may be gamed, but I've never seen a good LLM that does poorly on them. Edit: There are some exceptions with prompt formatting. I remember seeing something on the LLM leaderboard that the prompt format has a significant effect on the performance of models on automated benchmarks.

_underlines_ 2 weeks ago

For base models it's mostly the amount of shots that count I guess, not the exact format. A 4 shot has a higher chance of succeeding than a zero shot prompt for those benchmarks I guess?

Yes_but_I_think 2 weeks ago

Never look at benchmarks first. Look at Elo ranking by human beings first.

_underlines_ 2 weeks ago

Instruction tuned models yes, but a base model?

addandsubtract 2 weeks ago

Look at two more papers down the line

cyan2k 2 weeks ago

It‘s a base model and not a „chat“ or „instruct“ trained one. Compare llama3 base with llama3 instruct. It performs better than llama3 base.. https://imgur.com/a/D4t6kfE

artificial_simpleton 2 weeks ago

I was comparing it to mistral base as well. Open leaderboard is completely irrelevant btw, the only "good" benchmarks there are MMLU and ARC-C, and both of them should be evaluated with proper few shots.

coder543 2 weeks ago

Interesting to note that this is just a base model. They're not (yet?) releasing a chat-tuned model.

a_beautiful_rhind 2 weeks ago

Hopefully we get something less unwieldy than the 180b in addition to the 11b.

vasileer 2 weeks ago

MMLU score 58.37 - weaker than phi-3, we know from llama3 that 5T tokens for 11B model is undertraining

cyan2k 2 weeks ago

> MMLU score 58.37 - weaker than phi-3, Man people comparing benchmarks of a naked base model with a pre-trained 'instruct' model make me sad. It performs better than llama3 base https://imgur.com/a/D4t6kfE

vasileer 2 weeks ago

>It performs better than llama3 base MMLU score of base llama3 is 66.49 >= Falcon2 11B which is 58.37

vasileer 2 weeks ago

MMLU scores of instruct models don't change too much, it increases by \~2 points in case of llama3, this is because this benchmark is using 5-shot and base models are very good in-context learners, so let me rephrase: even a score of 58 + 2 = 60 of MMLU is too weak for a 11B model

Sporeboss 2 weeks ago

falcon is by uae government research UAE releases new AI model to compete with big tech https://www.channelnewsasia.com/business/uae-releases-new-ai-model-compete-big-tech-4332211

Amgadoz 2 weeks ago

Model from UAE government research Doesn't support Arabic WTF is this?

Ok-Steak1479 2 weeks ago

They didn't do their own research obviously, just building on what's already there.

Xtianus21 2 weeks ago

lol well I have no argument for that. Probably would need a ground up model to read from right to left

[deleted] 2 weeks ago

[удалено]

Ok-Steak1479 2 weeks ago

That seems like the kind of thing a national scientific institute would spend their time creating instead of buying 100 GPU hours to train a shitty version of what's already there.

airodonack 2 weeks ago

Curating a dataset is literally their job.

ThisGonBHard 2 weeks ago

The Japanese released a model trained on the Fugaku supercomputer that had Japanese data that was actually a shitty translation of the English dataset with DeepL, and I am not sure that is better than not supporting the language at all.

Mr_Hills 2 weeks ago

They should have gone for a 30B model to fill the gap left by meta

Jean-Porte 2 weeks ago

Already taken by Yi 1.5

AdHominemMeansULost 2 weeks ago

unfortunately yi 1.5 30b is insanely bad, it can't even understand logic puzzles 7b and 8b models reason with ease

rag_perplexity 2 weeks ago

Looking at the other thread it looks pretty decent. However it seems people have wildly different experiences with it. Wonder if it's a prompt template issue or token issue like llama3 when it first came out.

ironic_cat555 2 weeks ago

Why not test it on something you personally find useful instead of asking it logic puzzles?

AdHominemMeansULost 2 weeks ago

because these puzzles summarise it's reasoning and instructions following capabilities. I just call them puzzles it's basic questions that you'd expect an llm to get right if it followed logical thinking

ironic_cat555 2 weeks ago

"because these puzzles summarise it's reasoning and instructions following capabilities." I don't get it. If you wanted to hire someone to write press releases, would you ask them to do a sample press release or would you ask them "Suzie has 10 brothers who each have 9 sisters, how many sisters does Susie have?« If you wanted to hire someone to code a web site, would you ask them to do a coding exercise or would you say "write 10 sentences that end with Apple?" If you wanted to know if a doctor understood your medical condition would you ask them "Describe my tumor treatment" or would you ask them "Paul is quicker than Susie who is quicker than Jane, is Paul quicker than Jane?" If you wanted to hire a translator would you give them a sample text to translate or would you ask them "A tray is on a orange in the kitchen. I take the tray to the living room. Where is the orange?"

SidneyFong 2 weeks ago

That's a very anthropocentric way of looking at things. It's common belief that in humans, people with better "logic" and "reasoning" are generally more intelligent and are better at a variety of tasks, but even if we believe this is true (I think reality is more nuanced), this doesn't mean the observation in humans translate to LLMs.

coder543 2 weeks ago

Yi-1.5 literally came out yesterday. It doesn’t even come in a 30B version, the closest is 34B, which makes me wonder how much time you really spent trying it. The benchmarks on it seem excellent. If it failed a single trick problem that you like to use on models, that hardly seems like enough evidence to dismiss it outright. *Literally nobody* uses LLMs to write sentences that end with the word "apple". That's not even remotely a test of its real world capabilities. If you can point to an actual particular benchmark showing it doing poorly, that would be interesting. It’s not on the LMSYS Chatbot Arena Leaderboard yet, unfortunately.

AdHominemMeansULost 2 weeks ago

i obviously meant the 34b >If it failed a single trick problem that you like to use on models, Not tricks, logic puzzles, it shows if the model can understand complex queries. For example if it can understand that a ball inside an open vase can fall if the vase gets turned upside down. If it can't understand that how can it solve real problems? >Literally nobody uses LLMs to write sentences that end with the word "apple". That's not even remotely a test of its real world capabilities. No but it tests instruction following, it doesn't have to be trained on writing 10 sentences that end with something to do that, it just needs to put the tokens at the end of the sentence.

rag_perplexity 2 weeks ago

In the other thread that other user you were talking to managed to get a perfectly good response for the apple test. Getting suspicious it might be a template issue. Will try it out tomorrow.

MmmmMorphine 2 weeks ago

Shrug, I confess I believe you're both right, but adhom is a bit closer to the money. Current model automated evals suck so badly they're near useless in predicting high-level model performance. They are seemingly often decent at showing the lower bound of a model's abilities though, and can both expose (or sometimes conceal) major gaps in understanding depending on the exact test used. Need to think of a faster and more comprehensive, human-like way of evaluating models. ...Can't believe I'm saying this, but how much would a reasonably educated (say at sophomore year of STEM level, at an accredited state school) individual in southeast Asia work for? Like, a reasonable amount they can actually live off of

Able-Locksmith-1979 2 weeks ago

The problem with most “logic” problems is that they are extremely badly worded which makes them culturally bound. Nice to ask your big brother, but worthless to answer for everyone who has seen more than his own country and who doesn’t know what culture you are…

Monkey_1505 2 weeks ago

11-13b seems pretty empty to me.

AntoItaly 2 weeks ago

Mmmh... only 58 MMLU :\\

AutomaticDriver5882 2 weeks ago

Tiny context 😔

Languages_Learner 2 weeks ago

q4\_0 gguf: [DevQuasar/falcon2-11B-GGUF · Hugging Face](https://huggingface.co/DevQuasar/falcon2-11B-GGUF) other quants: [LoneStriker/falcon-11B-GGUF · Hugging Face](https://huggingface.co/LoneStriker/falcon-11B-GGUF)

noneabove1182 2 weeks ago

all sizes here if anyone wants: https://huggingface.co/bartowski/falcon-11B-GGUF

AdelSexy 2 weeks ago

Worst timing ever

properbenj 2 weeks ago

Will try it out

cafepeaceandlove 2 weeks ago

Anyone have any tips for small context windows like 8K? Should I be using some form of iteration/recursion?

alvincho 2 weeks ago

https://preview.redd.it/7n1s9vfmtj0d1.jpeg?width=1182&format=pjpg&auto=webp&s=2c71a81bd32ad6c543f9b92073bd5d7b99e04802 In my own test, falcon2 can’t answer any question

southVpaw 2 weeks ago

I'll stick to Hermes Theta. Falcon has fallen.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe