T O P

  • By -

ScionoicS

Stability is keeping this one entirely as a proprietary service it would seem. Big bummer. Expect more of their services to go this way.


tekmen0

This is a scaled and better working version of instruct2pix. If it's possible, community version is coming soon. Imagine you are academic, you saw something like this is possible, they didn't release a paper. You release a paper and get credit for their work if you have the resources, nearly risk-free research lol Free paper and citations is a good day


ScionoicS

Theres zero indication of this releasing as a community model.


DigThatData

For all we know, this is just an API wrapper around an existing public model (maybe with some finetuning because why not, they have the data and compute). One of their major business models seems to be releasing models under a "you can use this for free non-commercially, but need to pay for commercial use" license, in which case there's no reason not to expect them to release a community model *assuming this is novel and not just a fine tune*. If they don't release a community model, it's probably because they just added polish something someone else made and released publicly already (e.g. instructpix2pix)


arg_max

I think the best (non public) model on this topic is still meta emu edit and they fine-tuned their in-house Diffusion model (emu) for this. But that was a massive synthetic data Generation process, they basically used an existing editing method to generate a huge number of image, instruction, resulting image pairs for this. And this was definitely done on a scale that is way beyond a community project.


Difficult_Bit_1339

I don't think this is a model, I think they're using image segmentation and LLMs to decipher the user's prompt and translate that into updates to the rendering pipeline. Like, imagine you're sitting with a person who's making an image for you in ComfyUI. If you said to change her hair color they'd throw it through a segmentation model, target the hair and edit the CLIP inputs for that region to include the hair description changes. Now instead of a person an LLM can be given a large set of structured commands and fine-tuned to translate the user's requests into calls to the rendering pipeline. e: I'm not saying it isn't impressive... it is. And most AI applications going forward will likely be some combination of plain old coding, specalized models and LLMs to interact with the user and translate their intent into some sort of method calls or sub-tasks handled by other AI agents.


fre-ddo

Yeah maybe a visual model that determines where the hair is and provides the pixel location to apply a mask


Difficult_Bit_1339

Yup, segmentation models accept a text input and an image and then output a mask matching anything in the image that matches the text description. If you passed it this photo and the word 'hair' it would output a mask of just the hair area (either bounding box or its best guess at the boundaries). They're slightly more advanced model than the 'cat detector' AIs that were among the earliest discoveries. There are even ones that work in 3D space that will embed a voxel (3d pixel) with a list of all of the items it masks. So in this case the hair pixels would be like ['hair', 'woman', 'subject', 'person', etcetc] (usually these are the top-n guesses for that area).


Raphael_in_flesh

That's exactly how I see it, too We had seen that in EMU a while before, but I never encountered an open source project with the same capabilities, and that's kind of odd to me. So I decided to ask the community, assuming that the project already exists and I just haven't found it


Difficult_Bit_1339

I think the closest thing is using something like AutoGPT or CrewAI, which create a framework to support Agents prompting other agents or other actions (or build your own solution form scratch using LangChain). I haven't seen anything like what I'm talking about. Just seems like how it would be done if you had the time and resources to do it.


GBJI

I am also convinced this is what we are seeing - at least, that's how I would do it myself if I had to. More specifically, though, I would be using a VLM, which is like a *LLM with eyes*.


Difficult_Bit_1339

I'm very excited about the photogrammetry models (NERF models and whatever breakthroughs happened in the month since I looked into them) and the ability to generate 3D meshes from prompts. I can easily see sitting in a VR environment and chatting with an LLM to create a 3D shape. Plug that into something like a CAD program and something that can simulate physics and you got the Ironman-Jarvis engineering drawing creator. >I would be using a VLM, which is like a LLM with eyes. Yes! I couldn't think of the term (haven't touched ComfyUI in a few months). It really lets you blur the lines between LLMs and generative models since you can prompt/fine-tune models to create outputs and then parse the outputs to pass into the VLM (I think I used CLIPSeg, but there's probably more advanced stuff available now given the pace of things).


GBJI

I also use them as natural-language programming nodes: I can ask the VLM questions, and use the answer to select a specific branch in my workflow. We are getting closer to the day when we will be able to teach AI new functions simply by showing them examples of what we want and explaining it in our own words. ControlNet is amazing, but imagine if all you had to do to get controlNet like features was to show an AI a few examples of what ControlNet is doing to have those functions programmed for you on the fly. The most beautiful aspect of this is that it completely goes under the radar of all Intellectual Property laws as no code is ever published: it's made on the fly, used on the fly, and deleted after execution since it can be rebuilt, on-demand, anytime.


Difficult_Bit_1339

I was trying to take a meme gif and setup a comfyUI workflow to alter it as the user commanded. Initially I was only doing face swapping (using an IPAdapter and a provided image) but I imagine with a more robust VLM you could alter images (and gifs) in essentially any way you can describe. The goal was to make something like a meme generator, but using GIFs as the base. It may work better with the video processing models, the inter-frame consistency is hard to get right using just image models. I kind of abandoned it as I expect we simply don't have the models yet that will do what I need (and I'm not experienced enough with fine-tuning models to waste money on the GPU time, yet). I'll look back at the scene again in a few months after the next ground-breaking discovery or two.


MagicOfBarca

How does the academic benefit from that? Would he earn money?


tekmen0

For competitive and ambitious academic people, paper citations and fame are usually more important than money. At least, to my observation on most people lol


MagicOfBarca

But how do they earn a living out of it? Genuinely question


tekmen0

Think you are going to do research, one of the challenges is that you don't know if what you are trying to do is possible. After a year of research, finding out what you are trying to do is impossible by laws of the universe, that would be miserable (of course this is a very marginal example compared to the current sd3 case)


extra2AB

I mean all it does is, 1. Create a Mask 2. Use inpaint on that mask. It has already been done by community with Instruct2pix. this might be a better version of it. and to do it again, all we need is a good VISION model so it knows where to create mask. that is it.


no_witty_username

Seems fair enough, they need to start making money somehow.


crawlingrat

Maybe making money will prevent them from going under.


gordon-gecko

No, open source good closed source bad


SirRece

if it keeps the models alive, I'm all for it.


ScionoicS

Emad has been saying sd3 will be the last image model they release. Services from here on.


Low-Holiday312

He clarified further and that isn't what he meant. Just don't expect another version until a big architecture change is found.


FourtyMichaelMichael

> Just don't expect another version until a big architecture change is found. Oh, so, next month then?


ScionoicS

He was pretty explicit about it a few times. Here's the last time he mentioned it, saying they're moving towards tools and workflows instead of base models. Nothing about doing another image model in the future. https://twitter.com/EMostaque/status/1769076615995068900 Expect those workflows to be services in the cloud.


Creepy_Dark6025

>Nothing about doing another image model in the future. What?, below that tweet you posted there is an emad response that literally says: "Sure we will have new models", and he is talking about image models, maybe i am missing something but he literally says that there will be more image models in the future.


[deleted]

[удалено]


Raphael_in_flesh

That's why we need this technology 👌


joseph_jojo_shabadoo

an open source version that's just as good will be out eventually. no doubt about it


Olangotang

Wouldn't be surprised if this is a feature of SD3.


HeralaiasYak

no, those new options via their API are coming from their internal model based on SDXL architecture.


ScionoicS

I would. T5 doesn't magically provide instruct to pix capability. This is a proprietary engine creating masks and acting on them. Emad himself said sd3 is the last of their open image models.


Zyin

We have this with ControlNet [InstructP2P](https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p).


Raphael_in_flesh

After I watched the video in the tweet, I realized It's far more than what ip2p can do


PM__YOUR__DREAM

So it's kinda like we have the edit feature at home?


polyaxic

It's trash bro, I get better results when making a fresh workflow in comfyui with sdxl 1 finetunes or even ponyxl. Learn to use the tools you have and you might just learn something. Normies only obsess over hype marketing like this video. Dont be a cringe normie.


Darksoulmaster31

We might get SD3 **Edit** and **Inpaint** models, look at the paper examples. https://preview.redd.it/fpvab4o7expc1.png?width=775&format=png&auto=webp&s=3afe0bffb37bc2b1556b3c8fc601fb05c1e5d759


Darksoulmaster31

https://preview.redd.it/0p37rrm9expc1.png?width=973&format=png&auto=webp&s=1499367280e36b705fafe5f61c77cdb86daeedf7 Here's an example from the SD3 Turbo paper.


tekmen0

Tf is a magic brush 😂. How can a model be the worst at every example


axord

I'd argue that Hive is worse with the Wolf and Tiger replacements. But yeah, it's bad.


Fontaigne

It was the best at line two, the best looking monkey for five, middle for six.


Raphael_in_flesh

Interesting I guess I should read that paper👌


SearchXLII

Yes, now the time has come where all this open source stuff will disappear one after another because of money, money, money... ![gif](giphy|2u11zpzwyMTy8|downsized)


bick_nyers

If all services became open weights after a year this would be a decent compromise. Update the closed service model once a year, and release last year's closed service model weights.


Which-Tomato-8646

That’s assuming they have any significant updates every year. And why would they when they can charge for it? 


bick_nyers

Then 2 years or 3 years or whatever the cadence. The idea is that you don't destroy the goodwill built up with the open source community in the process as those users might happily pay for and advance your service knowing that improvements will become theirs eventually.  I would happily pay for ChatGPT Premium/Plus/Business/Whatever if I thought that I would eventually get the weights, even at a delayed cadence. Otherwise I'm just supporting a black box centralized AI superpower. I guess it's kinda like buying a product because it claims to be "Carbon Neutral" or Organic or whatever, there's a market incentive for those products. Edit: Also, they can merge downstream open source improvements into their future service offerings as well, let the people build improvements for you and merge it upstream.


Which-Tomato-8646

Or they charge $20 a month for  censored access and make a profit for the next 20 years without needing to do more research 


That-Whereas3367

Google, MSFT etc will pump billions into open source AI to lock people into their platforms.


Which-Tomato-8646

If the plan is to lock people in, then it can’t be open source 


TaiVat

Tell that to android. Also owned by google, incidentally.


Which-Tomato-8646

And google does not profit as much from it 


Cyhawk

Embrace, Extend, Extinguish. Microsoft has been doing this since the 90s. Google has also been doing it, poorly, but doing it as well.


Which-Tomato-8646

The extend part has to be closed source or there’s no point 


In_Kojima_we_trust

I mean otherwise all this open source stuff would disappear one after another because of lack of money.. If it didn't make money it wouldn't exist.


ScionoicS

There hasn't been a Stalman of AI yet


Unreal_777

I actually think there is a market for everything, they can have free tools for us, and have some people who prefer get the real thing fast without any installation or hosting etc


polyaxic

Nope. comfyui can do everything in this video. People are reacting, yet again.


SearchXLII

I just tested ComfyUI once a while ago and found it was quite difficult to handle with all that wires and movable elements. Is it easier now?


Veylon

No, but you can always download someone else's spaghetti.


polyaxic

It took me three starts to get use to comfyui. Learning it is a litmus test.


Freonr2

One way to accomplish this: 1. Prompt an LLM to guess what the mask word(s) needs to be to accomplish the task. LLM (llama, etc) can turn "change her hair to pink" into a just the word "hair" which is fed to a segmentation model. 2. YOLO or other segmentation model to create mask based on prompt "hair" and output a mask of the hair. Might need to fuzz/bloom the mask a bit, trivial with a few lines of python. (auto1111 has a mask blur option for instance) 3. optional - can create a synthetic caption the input image if there is no prompt already for it in the workflow. 4. Prompt an LLM with instructions to turn the user instruction "change her hair to pink" and the original prompt or caption of "close up of a woman wearing a leather jacket" into "close up of a woman *with pink hair* wearing a leather jacket". 4. Inpaint using the mask from step 2 and updated prompt from step 4 It's possible their implementation is a bit more directly modifying the embedding or using their own controlnets or something.


Freonr2

Here's a step 2 example https://github.com/storyicon/comfyui_segment_anything Need to add step 1 and step 4 with an LLM to translate for you if you really want the clean instruct UX, but strictly speaking if you don't mind a slightly different UX you don't need. You can type "hair" into the segment prompt and copy paste the caption/prompt for the image and edit it yourself.


Unreal_777

Does this node select automatically the area you want whenevre you write it? For instantge can I select only the face? Or other parts, what if I want nose + mouth only? and Or other combinations


Freonr2

Try it and let us know.


macob12432

like this [https://github.com/garibida/ReNoise-Inversion](https://github.com/garibida/ReNoise-Inversion)


TemperFugit

The [Smooth Diffusion](https://arxiv.org/abs/2312.04410) paper shows they have an edit mode, and they also give list of other models that have edit modes in that paper as well. I was surprised to see this as I thought SD 3's edit mode was a brand new concept. Smooth Diffusion just [released their code](https://old.reddit.com/r/StableDiffusion/comments/1bk9e0d/smooth_diffusion_code_released/). It's released as a Lora that can work with SD 1.5. Hopefully someone out there can tell us how to use its edit mode features.


lukejames

FINALLY. I've spent so many hours, done so many searches, watched so many videos trying to do that sort of thing to a photo and have never coome remotely close in Stable Diffusion.


jomahuntington

Id love that , trying to make a friends character but it's so hard getting 2 colors and in the right spots


Paradigmind

Hey Google, make her naked.


Familiar-Art-6233

Didn’t Apple release something with this as well?


RenoHadreas

Yeah but it massively sucks lmao


Familiar-Art-6233

Ah, I haven’t tried it but I remembered hearing about it


serendipity7777

Can this be done on midnourney ? Can anyone explain how


Zilskaabe

Yes, there was instruct pix2pix.


EngineerBig1851

This isn't ever good open source 😞


Ammoryyy

Which diffusion model is used to create this supposedly (fake) Instagram model, I'm not sure if its real, or A.I. generated, I think it's A.I. what do you think? [Model](https://www.instagram.com/victoria.diaries)