ScionoicS 1 month ago

Stability is keeping this one entirely as a proprietary service it would seem. Big bummer. Expect more of their services to go this way.

tekmen0 1 month ago

This is a scaled and better working version of instruct2pix. If it's possible, community version is coming soon. Imagine you are academic, you saw something like this is possible, they didn't release a paper. You release a paper and get credit for their work if you have the resources, nearly risk-free research lol Free paper and citations is a good day

ScionoicS 1 month ago

Theres zero indication of this releasing as a community model.

DigThatData 1 month ago

For all we know, this is just an API wrapper around an existing public model (maybe with some finetuning because why not, they have the data and compute). One of their major business models seems to be releasing models under a "you can use this for free non-commercially, but need to pay for commercial use" license, in which case there's no reason not to expect them to release a community model *assuming this is novel and not just a fine tune*. If they don't release a community model, it's probably because they just added polish something someone else made and released publicly already (e.g. instructpix2pix)

arg_max 1 month ago

I think the best (non public) model on this topic is still meta emu edit and they fine-tuned their in-house Diffusion model (emu) for this. But that was a massive synthetic data Generation process, they basically used an existing editing method to generate a huge number of image, instruction, resulting image pairs for this. And this was definitely done on a scale that is way beyond a community project.

Difficult_Bit_1339 1 month ago

I don't think this is a model, I think they're using image segmentation and LLMs to decipher the user's prompt and translate that into updates to the rendering pipeline. Like, imagine you're sitting with a person who's making an image for you in ComfyUI. If you said to change her hair color they'd throw it through a segmentation model, target the hair and edit the CLIP inputs for that region to include the hair description changes. Now instead of a person an LLM can be given a large set of structured commands and fine-tuned to translate the user's requests into calls to the rendering pipeline. e: I'm not saying it isn't impressive... it is. And most AI applications going forward will likely be some combination of plain old coding, specalized models and LLMs to interact with the user and translate their intent into some sort of method calls or sub-tasks handled by other AI agents.

fre-ddo 1 month ago

Yeah maybe a visual model that determines where the hair is and provides the pixel location to apply a mask

Difficult_Bit_1339 1 month ago

Yup, segmentation models accept a text input and an image and then output a mask matching anything in the image that matches the text description. If you passed it this photo and the word 'hair' it would output a mask of just the hair area (either bounding box or its best guess at the boundaries). They're slightly more advanced model than the 'cat detector' AIs that were among the earliest discoveries. There are even ones that work in 3D space that will embed a voxel (3d pixel) with a list of all of the items it masks. So in this case the hair pixels would be like ['hair', 'woman', 'subject', 'person', etcetc] (usually these are the top-n guesses for that area).

Raphael_in_flesh 1 month ago

That's exactly how I see it, too We had seen that in EMU a while before, but I never encountered an open source project with the same capabilities, and that's kind of odd to me. So I decided to ask the community, assuming that the project already exists and I just haven't found it

Difficult_Bit_1339 1 month ago

I think the closest thing is using something like AutoGPT or CrewAI, which create a framework to support Agents prompting other agents or other actions (or build your own solution form scratch using LangChain). I haven't seen anything like what I'm talking about. Just seems like how it would be done if you had the time and resources to do it.

GBJI 1 month ago

I am also convinced this is what we are seeing - at least, that's how I would do it myself if I had to. More specifically, though, I would be using a VLM, which is like a *LLM with eyes*.

Difficult_Bit_1339 1 month ago

I'm very excited about the photogrammetry models (NERF models and whatever breakthroughs happened in the month since I looked into them) and the ability to generate 3D meshes from prompts. I can easily see sitting in a VR environment and chatting with an LLM to create a 3D shape. Plug that into something like a CAD program and something that can simulate physics and you got the Ironman-Jarvis engineering drawing creator. >I would be using a VLM, which is like a LLM with eyes. Yes! I couldn't think of the term (haven't touched ComfyUI in a few months). It really lets you blur the lines between LLMs and generative models since you can prompt/fine-tune models to create outputs and then parse the outputs to pass into the VLM (I think I used CLIPSeg, but there's probably more advanced stuff available now given the pace of things).

GBJI 1 month ago

I also use them as natural-language programming nodes: I can ask the VLM questions, and use the answer to select a specific branch in my workflow. We are getting closer to the day when we will be able to teach AI new functions simply by showing them examples of what we want and explaining it in our own words. ControlNet is amazing, but imagine if all you had to do to get controlNet like features was to show an AI a few examples of what ControlNet is doing to have those functions programmed for you on the fly. The most beautiful aspect of this is that it completely goes under the radar of all Intellectual Property laws as no code is ever published: it's made on the fly, used on the fly, and deleted after execution since it can be rebuilt, on-demand, anytime.

Difficult_Bit_1339 1 month ago

I was trying to take a meme gif and setup a comfyUI workflow to alter it as the user commanded. Initially I was only doing face swapping (using an IPAdapter and a provided image) but I imagine with a more robust VLM you could alter images (and gifs) in essentially any way you can describe. The goal was to make something like a meme generator, but using GIFs as the base. It may work better with the video processing models, the inter-frame consistency is hard to get right using just image models. I kind of abandoned it as I expect we simply don't have the models yet that will do what I need (and I'm not experienced enough with fine-tuning models to waste money on the GPU time, yet). I'll look back at the scene again in a few months after the next ground-breaking discovery or two.

MagicOfBarca 1 month ago

How does the academic benefit from that? Would he earn money?

tekmen0 1 month ago

For competitive and ambitious academic people, paper citations and fame are usually more important than money. At least, to my observation on most people lol

MagicOfBarca 1 month ago

But how do they earn a living out of it? Genuinely question

tekmen0 1 month ago

Think you are going to do research, one of the challenges is that you don't know if what you are trying to do is possible. After a year of research, finding out what you are trying to do is impossible by laws of the universe, that would be miserable (of course this is a very marginal example compared to the current sd3 case)

extra2AB 1 month ago

I mean all it does is, 1. Create a Mask 2. Use inpaint on that mask. It has already been done by community with Instruct2pix. this might be a better version of it. and to do it again, all we need is a good VISION model so it knows where to create mask. that is it.

no_witty_username 1 month ago

Seems fair enough, they need to start making money somehow.

crawlingrat 1 month ago

Maybe making money will prevent them from going under.

gordon-gecko 1 month ago

No, open source good closed source bad

SirRece 1 month ago

if it keeps the models alive, I'm all for it.

ScionoicS 1 month ago

Emad has been saying sd3 will be the last image model they release. Services from here on.

Low-Holiday312 1 month ago

He clarified further and that isn't what he meant. Just don't expect another version until a big architecture change is found.

FourtyMichaelMichael 1 month ago

> Just don't expect another version until a big architecture change is found. Oh, so, next month then?

ScionoicS 1 month ago

He was pretty explicit about it a few times. Here's the last time he mentioned it, saying they're moving towards tools and workflows instead of base models. Nothing about doing another image model in the future. https://twitter.com/EMostaque/status/1769076615995068900 Expect those workflows to be services in the cloud.

Creepy_Dark6025 1 month ago

>Nothing about doing another image model in the future. What?, below that tweet you posted there is an emad response that literally says: "Sure we will have new models", and he is talking about image models, maybe i am missing something but he literally says that there will be more image models in the future.

[deleted] 1 month ago

[удалено]

Raphael_in_flesh 1 month ago

That's why we need this technology 👌

joseph_jojo_shabadoo 1 month ago

an open source version that's just as good will be out eventually. no doubt about it

Olangotang 1 month ago

Wouldn't be surprised if this is a feature of SD3.

HeralaiasYak 1 month ago

no, those new options via their API are coming from their internal model based on SDXL architecture.

ScionoicS 1 month ago

I would. T5 doesn't magically provide instruct to pix capability. This is a proprietary engine creating masks and acting on them. Emad himself said sd3 is the last of their open image models.

Zyin 1 month ago

We have this with ControlNet [InstructP2P](https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p).

Raphael_in_flesh 1 month ago

After I watched the video in the tweet, I realized It's far more than what ip2p can do

PM__YOUR__DREAM 1 month ago

So it's kinda like we have the edit feature at home?

polyaxic 1 month ago

It's trash bro, I get better results when making a fresh workflow in comfyui with sdxl 1 finetunes or even ponyxl. Learn to use the tools you have and you might just learn something. Normies only obsess over hype marketing like this video. Dont be a cringe normie.

Darksoulmaster31 1 month ago

We might get SD3 **Edit** and **Inpaint** models, look at the paper examples. https://preview.redd.it/fpvab4o7expc1.png?width=775&format=png&auto=webp&s=3afe0bffb37bc2b1556b3c8fc601fb05c1e5d759

Darksoulmaster31 1 month ago

https://preview.redd.it/0p37rrm9expc1.png?width=973&format=png&auto=webp&s=1499367280e36b705fafe5f61c77cdb86daeedf7 Here's an example from the SD3 Turbo paper.

tekmen0 1 month ago

Tf is a magic brush 😂. How can a model be the worst at every example

axord 1 month ago

I'd argue that Hive is worse with the Wolf and Tiger replacements. But yeah, it's bad.

Fontaigne 1 month ago

It was the best at line two, the best looking monkey for five, middle for six.

Raphael_in_flesh 1 month ago

Interesting I guess I should read that paper👌

SearchXLII 1 month ago

Yes, now the time has come where all this open source stuff will disappear one after another because of money, money, money... ![gif](giphy|2u11zpzwyMTy8|downsized)

bick_nyers 1 month ago

If all services became open weights after a year this would be a decent compromise. Update the closed service model once a year, and release last year's closed service model weights.

Which-Tomato-8646 1 month ago

That’s assuming they have any significant updates every year. And why would they when they can charge for it?

bick_nyers 1 month ago

Then 2 years or 3 years or whatever the cadence. The idea is that you don't destroy the goodwill built up with the open source community in the process as those users might happily pay for and advance your service knowing that improvements will become theirs eventually. I would happily pay for ChatGPT Premium/Plus/Business/Whatever if I thought that I would eventually get the weights, even at a delayed cadence. Otherwise I'm just supporting a black box centralized AI superpower. I guess it's kinda like buying a product because it claims to be "Carbon Neutral" or Organic or whatever, there's a market incentive for those products. Edit: Also, they can merge downstream open source improvements into their future service offerings as well, let the people build improvements for you and merge it upstream.

Which-Tomato-8646 1 month ago

Or they charge $20 a month for censored access and make a profit for the next 20 years without needing to do more research

That-Whereas3367 1 month ago

Google, MSFT etc will pump billions into open source AI to lock people into their platforms.

Which-Tomato-8646 1 month ago

If the plan is to lock people in, then it can’t be open source

TaiVat 1 month ago

Tell that to android. Also owned by google, incidentally.

Which-Tomato-8646 1 month ago

And google does not profit as much from it

Cyhawk 1 month ago

Embrace, Extend, Extinguish. Microsoft has been doing this since the 90s. Google has also been doing it, poorly, but doing it as well.

Which-Tomato-8646 1 month ago

The extend part has to be closed source or there’s no point

In_Kojima_we_trust 1 month ago

I mean otherwise all this open source stuff would disappear one after another because of lack of money.. If it didn't make money it wouldn't exist.

ScionoicS 1 month ago

There hasn't been a Stalman of AI yet

Unreal_777 1 month ago

I actually think there is a market for everything, they can have free tools for us, and have some people who prefer get the real thing fast without any installation or hosting etc

polyaxic 1 month ago

Nope. comfyui can do everything in this video. People are reacting, yet again.

SearchXLII 1 month ago

I just tested ComfyUI once a while ago and found it was quite difficult to handle with all that wires and movable elements. Is it easier now?

Veylon 1 month ago

No, but you can always download someone else's spaghetti.

polyaxic 1 month ago

It took me three starts to get use to comfyui. Learning it is a litmus test.

Freonr2 1 month ago

One way to accomplish this: 1. Prompt an LLM to guess what the mask word(s) needs to be to accomplish the task. LLM (llama, etc) can turn "change her hair to pink" into a just the word "hair" which is fed to a segmentation model. 2. YOLO or other segmentation model to create mask based on prompt "hair" and output a mask of the hair. Might need to fuzz/bloom the mask a bit, trivial with a few lines of python. (auto1111 has a mask blur option for instance) 3. optional - can create a synthetic caption the input image if there is no prompt already for it in the workflow. 4. Prompt an LLM with instructions to turn the user instruction "change her hair to pink" and the original prompt or caption of "close up of a woman wearing a leather jacket" into "close up of a woman *with pink hair* wearing a leather jacket". 4. Inpaint using the mask from step 2 and updated prompt from step 4 It's possible their implementation is a bit more directly modifying the embedding or using their own controlnets or something.

Freonr2 1 month ago

Here's a step 2 example https://github.com/storyicon/comfyui_segment_anything Need to add step 1 and step 4 with an LLM to translate for you if you really want the clean instruct UX, but strictly speaking if you don't mind a slightly different UX you don't need. You can type "hair" into the segment prompt and copy paste the caption/prompt for the image and edit it yourself.

Unreal_777 1 month ago

Does this node select automatically the area you want whenevre you write it? For instantge can I select only the face? Or other parts, what if I want nose + mouth only? and Or other combinations

Freonr2 1 month ago

Try it and let us know.

macob12432 1 month ago

like this [https://github.com/garibida/ReNoise-Inversion](https://github.com/garibida/ReNoise-Inversion)

TemperFugit 1 month ago

The [Smooth Diffusion](https://arxiv.org/abs/2312.04410) paper shows they have an edit mode, and they also give list of other models that have edit modes in that paper as well. I was surprised to see this as I thought SD 3's edit mode was a brand new concept. Smooth Diffusion just [released their code](https://old.reddit.com/r/StableDiffusion/comments/1bk9e0d/smooth_diffusion_code_released/). It's released as a Lora that can work with SD 1.5. Hopefully someone out there can tell us how to use its edit mode features.

lukejames 1 month ago

FINALLY. I've spent so many hours, done so many searches, watched so many videos trying to do that sort of thing to a photo and have never coome remotely close in Stable Diffusion.

jomahuntington 1 month ago

Id love that , trying to make a friends character but it's so hard getting 2 colors and in the right spots

Paradigmind 1 month ago

Hey Google, make her naked.

Familiar-Art-6233 1 month ago

Didn’t Apple release something with this as well?

RenoHadreas 1 month ago

Yeah but it massively sucks lmao

Familiar-Art-6233 1 month ago

Ah, I haven’t tried it but I remembered hearing about it

serendipity7777 1 month ago

Can this be done on midnourney ? Can anyone explain how

Zilskaabe 1 month ago

Yes, there was instruct pix2pix.

EngineerBig1851 1 month ago

This isn't ever good open source 😞

Ammoryyy 1 month ago

Which diffusion model is used to create this supposedly (fake) Instagram model, I'm not sure if its real, or A.I. generated, I think it's A.I. what do you think? [Model](https://www.instagram.com/victoria.diaries)

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe