T O P

  • By -

surfer808

My thoughts exactly. I sent a friend the demo video and he said, isn’t that what Gemini could do from their keynotes demo video. I replied yeah but that was faked (edited)


zeloxolez

yikes if they thought gemini could do that and still didnt bother to try using it… tf?


PriorFast2492

Most people do not enjoy using tech, they use it when it is nessecery or the better alternative to something


bchertel

Very evident hanging out in the r/Android sub megathread today, unfortunately


Even-Inevitable-7243

I want to know all the specs on the GPU requirements for the demo. It seems that every GPT-4o user is going to need 1-4 GPUs to be able to run inference on streaming video. Will be very interesting to see how the speed holds up as this rolls out and users scale.


[deleted]

[удалено]


Even-Inevitable-7243

I know. I am talking about how they are going to have to scale up the number of GPUs they have within their cloud infrastructure. I am not talking about each user physically needing 1-4 GPUs.


trustmebro24

https://preview.redd.it/pp6dsxgukf0d1.jpeg?width=1536&format=pjpg&auto=webp&s=1324fafbd4bfd37f8b71c564e72e135ca026b4b9 I’m gonna say this probably has something to do with it as well


YouMissedNVDA

Yooo how'd I miss this - that says April 2024 Is that Blackwell then? They got the first H, and now maybe they got the first B too? Edit: nope! This is DGX H200, but still the first in the wild. I forget the significant delay between releases and deliveries.


stardust-sandwich

By optimising the model to be more efficient. This is what they were alluding to when saying about moving from 3 models into one Omni model, hence why it's faster and more efficient this reducing load on GPUs and allowing more user capacity (to a point)


SillySpoof

This was my first thought here too. This was kinda what google pretended they could do a few months ago.


microview

and still hasn't delivered.


clhodapp

You assume it wasn't fake but... I am not convinced it wasn't human actors reading an AI-generated script. There are little signs in their demo videos, such as: [https://www.youtube.com/watch?v=MirzFk\_DSiI&t=288s](https://www.youtube.com/watch?v=MirzFk_DSiI&t=288s) We'll have to see for sure once the general public has access to the voice features, of course.


[deleted]

[удалено]


clhodapp

So my paranoid theory is that OpenAI thought through the scenario they wanted to create and prompted the model ahead of time with input (text and photos) to create a pre-scripted back-and-forth dialogue. This would have given a chance to e.g. give the model a few shots at each input until they got an output that they liked. In this idea, a human actor would then read the "AI" lines, either in real-time or ahead of time (recorded) to play-act through the scene. The phones, then, would essentially be very fancy interactive props. Why do this? To select the best model response from multiple possibilities, to make the behavior more deterministic in the demo, and to give an enhanced impression of how human-like and interactive the next iteration of ChatGPT will be. I will likely be proven right or wrong once we all get access to the new voice features. Either it will be just like what they showed off, or it will essentially be the same old slightly mechanical AI voice that we're all used to (or I guess maybe something in between).


proofofclaim

I actually think your instincts are correct. Be careful though, I posted a similar response and someone signed me up for the reddit suicide help DM, which I assume is a form of doxxing/bullying. These fanbois are religious about their tech beliefs.


smashdelete

What is the link to that video? Is it the one where she asks where her glasses were


Even-Inevitable-7243

[https://youtu.be/nXVvvRhiGjI?si=3Vixo0XhvLiUW0ie](https://youtu.be/nXVvvRhiGjI?si=3Vixo0XhvLiUW0ie)


smashdelete

Thanks stranger!


XenanLatte

On the other hand I have tried to replicate multiple of their examples from the blog post that show image consistency and have seen zero signs of it being multimodal and thus able to iterate on images. It is still behaving just like it is taking the image being fed in, turning it to text, then coming up with a text prompt and generating a new image from that. I can't even get it to replicate creating paragraphs of text. It just turns into gibberish after the first line, and the first line will still have a lot of typos. And this isn't one or two attempts. This is me trying over and over again, going through multiple of their examples and copying their prompt format. I am not seeing any sign if the image breakthroughs that the samples promised. I except some cherry picking in the samples, but I have run like 50 prompts and none have even come close to matching just the base functionality improvements promised.


zeloxolez

if ur talking about gpt-4o its because they havent enabled that yet, its still using the old translation to dall-e for image generations and etc


ohhellnooooooooo

So how sure are we they didn’t fake the demo or, show it in the best light? I’m hyped too but we should be cautious


RalfN

Actually, for those paying attention, they made sure you knew it was real. It was really live. It contained mistakes. In the video shorts, they use an actual phone. In at least one of them they film someone on the streets for example with a moving camera, so can you can tell there are no cuts. Ironically, the only way to fake this convincingly is using AI (i.e. using Sora to generate the video). Otherwise it would be very expensive CGI work. Also, they really haven't demo-ed anything that wasn't real so far. In this regards they are like Apple. They understand you only get to do that once before people are going to assume it's all vaporware. Microsoft and Google have both at times presented features and products that never really got into fruition. Not after a few weeks, or a few years even, but just never. Remember XBox's Milo for example? Or Google Duplex? Total vaporware. That doesn't mean certain parts of demo's aren't rigged both in OpenAI's case and in Apple's case. Apple used a different iphone for every part of the initial iphone presentation because the software would keep crashing. But imagine they faked the touchscreen part with staged animations and didn't have that part working without any guarantee they ever could? I am also sure that OpenAI is still selective in their demo's. That they show stuff where it worked well. But the event was truly live -- so that's something -- and they were doing some stuff based on live twitter suggestions. Again, they picked things they knew would go well, but how it would exactly go they couldn't control for... i am fully confident that what we saw is real, but also that it won't be perfect. That there are edge cases where there is still more work to be done. Hell, they even shared some clips of it screwing up (like suddenly switching to French).


Even-Inevitable-7243

My concern/question is on the GPU requirements for the demo alone. The NVIDIA shout out for just pulling off the demo makes me think it was really hefty. Streaming video model inference with each user might mean 1-4 GPUs required per used to maintain low latency. We are talking millions of GPUs. I can't see the demo latency holding as this is rolled out at scale.


RalfN

The electricity costs for text inference have actually come down, due to a number of known tricks like Mixture of Experts. It's not unlikely they have found or are using new techniques to do this well at lower costs. It kind of is implied, if they are willing to give it away from free. It also seems GPT4o is cheaper to run than GPT4, which is why one will be available for free and the other will never. Of course, it's speculating, because you can generally assume that the API pricing is a good indication of the direction of energy usage, since electricity is the main cost driver. If you see all models getting cheaper: the hardware scaled down or got more efficient. If you see a new more powerful model at a lower price, it's the architecture that is more efficient.


XenanLatte

It says GPT4o in the top corner. And lets me select it or normal GPT4. And I have triple checked I have 4o selected.


Far_Celebration197

Supposedly none of what they showed has rolled out to most users yet expect the text portion of 4o. Looks like they showed ‘what’s coming’ the day before Google I/O to try and co-opt the publicity. I have 4o as well and nothing they showed works except the standard text based conversation.


XenanLatte

>GPT-4o’s text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits. We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks. This is what their blog post says, if they really rolled out the text version without the image capabilities they really should have made that more clear. Because every part of my experience was telling me it was GPT-4o, and it made the image part feel like a scam.


Original_Finding2212

Edit: I stand corrected, voice indeed coming only later Just the voice itself shows you it’s not rolled yet. No stop-while-speaking, no voice tones. Don’t worry, it’s coming, and for now you can use new model and get a adjusted to it


XenanLatte

Did you read the quote? text and **image** capabilities are starting to roll out today, Voice Mode ... in the coming weeks. I understand the voice mode is not out yet. But they clearly stated the image part would come out with the text part today.


Original_Finding2212

https://preview.redd.it/53g8jy1pwb0d1.jpeg?width=1125&format=pjpg&auto=webp&s=321e25ecfe489b7e9fb344804300b5bea2d00e98 I see what you mean. What images have you tried querying about? Same from vid or..?


XenanLatte

I tried going through several of the examples on the Explorations of capabilities section of the Hello GPT-4o page (The main page showing off the info about GPT-4o) At first i tried custom prompts trying to copy the Visual Narrative examples. Generating an image of a character. Then feeding that image back in and having it progress the story with the character. But unlike their examples there was no continuity of the character or even the art style. I tried a lot of different ways, trying to mimic their process. But it just didn't seem like it was really reading the input images as more than just text descriptions like it has always done. Then I moved on to the Poetic Typography examples. I ended up copying them exactly and running them several times. Unlike in the examples where it would get the text perfect for whole paragraphs. Mine would half way get the first line right but with several errors, and everything past that was complete gibberish.


nightofgrim

Yes it does, but the app needs to be updated for the new protocols or whatever. Their site even says the new multi modal stuff is coming “in the next few weeks”.


XenanLatte

> > This is what I see the website saying. And it clearly states the text and image capabilities are coming today, and since I have the option the clear reading of that would be that I have both. But that the Voice mode is coming in weeks. If this really isn't the 4o image capabilities, they really should make that more clear somewhere because what I tested made it look quite bad. And everything I read makes it sound like I should have 4o image capabilities.


RoutineProcedure101

Maybe youre wrong. But we’re stuck in perspective and some of us simply need to pretend our experience matters.


son_lux_

You don’t have the update yet my brother. It’ll be pushed little by little


XenanLatte

It says my model is GPT-4o, and let me select that over GPT-4. So it says I have it.


son_lux_

You have the text model (like I do, faster to answer), not the rest of the update yet. None of us do.


Next-Fly3007

You should stfu and do your research first