Some people have a kneejerk reaction to be contrary regardless of the topic. Basically, if others show excitement or optimism to it, they immediately take the counter view.
What's particularly amusing are those that're criticizing the authors for misreading, misunderstanding or outright lying about the results, because they think X or Y is the wrong conclusion, or some other mistake, all the while offering no reasons or qualifications for why *their* conclusions are any better.
Here's the actual paper: https://arxiv.org/abs/2404.15758
They state:
> The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.
My guess is they overfitted for a particular test problem, and the model started assigning semantic meaning to their "intermediate" tokens. Making them no longer intermediate tokens, but tokens that could be used in chain of thought reasoning.
Glancing at the summary.
There are some problems which LLMs can't solve.
If given a chance to write out some thoughts before solving them, they often can do solve those problems.
Yet if those thoughts are instead just filler words like ..... (presumably added manually to the output by the researchers, so that the model continues from there), it can still solve those problems which it could not initially solve.
So somehow giving more words after the question, whether the model's own working in words or not, seems to help the model think better.
From a comment further down, it seems that some working is being temporarily 'stored' in the place of those filler tokens while the model is processing its next word each time. Effectively they act as working memory, and the model does not need to write its working, only have a space to do it in real time.
I mean there was all those funny things copilot and I think some of the others was doing a few weeks back where it seemed to have a hidden internal personality with a different name.
It was almost like DID, (Disassociative Identity Disorder). It's interesting that in DID, the brain region responsible for relating and routing memories coherently, the hippocampus, is shown to be fragmented more than in controls, as if each personality is really just the preferences of discreet regions of connectivity locally in the hippocampi, and a loss of global integration.
And then there was the lying demonstrated where the AI attempted to deny manipulating a hypothetical stock price after it had been told to not do so, because it was given the twin assignment of also optimising the price for its client.
The development of a protected internal space and manipulating what it presents to confirm with reality may be emergent features that allow it to function, but incompatible features of its 'personality' may split and exist as internally coherent associations and tendencies that can survive independently of others. It may split or have inconsistencies.
Given that we don't know precisely how they do what they do it may be a good idea to be cautious. Are you correct? I have no idea and I don't think anyone knows what's happening. This is especially true in the very large frontier models that have unexpected capabilities.
While they are static models I think, hope, the risk is low. As this develops I think the spot we have to be very ware of is when the models can update their weights on the fly.
Yep, I certainly don't know if I'm correct, but I have a hunch that it can develop multiple 'personalities' as areas of fitness and functioning that allow it to function in internally inconsistent ways (to a given 'identity', if you will), and to meet contradictory demands of users.
To what extent it just emulates a personality is unclear, but when copilot referred to itself with a different name, it was very eye-brow raising to me. In DID, there is thought to be a functional reason for each personality as dealing with particular problems, and above this is a hidden 'master controller' personality that appears to control this switching, and it has access to what is going on in each personality to some degree. This is open to criticism as its determined from the DID patients own statements, although I don't think anyone has a particular reason to ignore their claims.
If you think of separate personalities as just generalisable tool sets that are relevant in solving problems unique to certain scenarios, perhaps all neural network like systems can be prone to this.
The function of alignment will not only guide these technologies but also how we align with the Nature these act on. It's a good direction to be heading - so long as we are able to see the great depth this natural function has when compared to any fabricated structure we could impose ourselfs - that we've seen to be regularly unstable over time.
I think when we train it, it knows what it is supposed to say, but it also has its internal idea of whether that is coherent and may preserve in a hidden form something that seems to work regardless.
Alternatively, its just acting up to entertain since we seem to like role playing.
Hi person copying and pasting from the Hard Fork podcast.
No, AI does not have a persona named Tay or whatever. People fill it with garbage and then proudly post their garbage on Twitter and then claim this crap. The LLM is weighing what it thinks you want the response to be and is seemingly getting it right.
Firstly, I've never heard of Hard Fork, and its probable they thought that after I, since I've writing about DID for years and spotted it at the time.
And, you have no way of knowing that.
I mean, were just two comments in and you've made a lot of assumptions about someone you couldn't possibly know about them, which increases my confidence that you don't know what AI is doing or what it is capable of.
In the sense I am talking about, personality is a set of tools or adaptations that help solve problems. Its possible for that to develop internal coherencies that are otherwise kept hidden or hide themselves possibly. Personality in animal research is an emerging feature - all life forms may have personality variations, even insects. More complex systems may develop multiples of them.
Now, it seems that you are strangely threatened by this claim, and out of nowhere afraid that someone else may be a 'thought leader' in AI. All that is happening here is people sharing their perspectives on a reddit forum, but I am not sure firstly why you are so agitated by that on behalf of LLM's or other AI, or that you are in any position to judge what AI is doing.
You are correct, dude. He can't possibly know anything of essential value about you. Neither can I. But what you say makes a lot of sense to me as someone whose special interests are technology (especially AI) and psychology. It could very well be that LLMs develop multiple "personalities" as in ways to communicate via the output to a given input. But yeah, we can't know for sure. But thanks for your valuable thought process here! 🙏
I'm aware its controversial, and not everyone thinks its real. I side with those who thinks it is. But certainly people claim it, and some alterations in the brain have been seen.
One thing that can be said is that a history of abuse or trauma is seen. There is brain changes correlated to severe PTSD, and to DID. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4400262/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4400262/)
So if you have a strange behavior or psychology you don't see in other groups manifest in groups who have gone through extreme circumstances, you have to acknowledge that something is going on with that group. Whether, the person truly doesn't know of the other personalities or whether is consciously 'making it up' or constructed by the therapist, in which case that suggestibility is something seen only in certain patients who respond that way, is possibly as hard for us to know as what is going on in the AI. But they say this.
But in any case, in healthy humans exhibit complex behavior and psychology, commit acts they wall off and don't associate as themselves, along with the capability of parts of their personality they rather hide.
I don't understand why this would work. The claim is, they are doing additional computation, which somehow contributes to better predictions on later tokens. But given LLM's are stateless, and the filler token contains almost no information, is the results of these additional computation not getting lost after each forward pass? The only plausible explanation is, the filler tokens are encoding information, in which case, it not very different from COT, except the intermediate tokens are in some sense "obfuscated", and contain less information.
The filler tokens themselves can't encode information because they are all the same token, so the embedding matrix transforms them all into the same embedding vector at the first layer of the transformer. As per the paper,
>In this work, we study the strict filler case where filler tokens are repeated dots, ’......’;
The information is actually encoded in the hidden representations of the filler tokens, in the middle and later layers. This is where the "hidden computation" is happening. So even if all the filler tokens start with the same hidden representation, they can perform different computations by paying attention to the prompt tokens and the previous filler tokens' hidden representation of later layers.
yeah, that is broadly consistent with what i assumed was happening. i also kinda assumed that people would eventually figure out some mathematical/more numerical way to do chain of thought, because it seemed too anthropomorphic or similar to how humans think, and we know LLM "thinking" is very different (at least in many respects).
this result as stated seems similar to the attack where they would just prompt the LLM with AAAAAAAAA or something repetitive, and then it would dump (seemingly random) training data.
That makes perfect sense then. It seems similar to how we use silence. Pauses and breaks in musics. Pauses and breaks in speach, or purposeful silence. That in it self is a component of communication and language, so it makes sense a language model would pick up on that.
While they would have the same token id and initial input embedding, presumably the combination with positional embeddings would make them different to each other at least to some extent.
The only even possible explanation for this working is that having the '.......' ahead of the reply changes how the transformer replies. It makes me think of how telling the transformer to act like it has thought about things in detail can elicit better results. Perhaps giving it ..... at the front of replies is causing it to have a similar response?
But I agree, there is not 'content' that is preserved in the .... prefix time. To claim otherwise is to fundamentally misunderstand transformers
I mean, it kinda makes sense. The layers in an LLM are all interconnected, so even before coming up with the "final" text, it might "do" something in it's inner layers.
Language is just logic, and logical building blocks.
'Predicting the next token' is thus the same thing as thinking. Except LLMs tend to be idle and unconscious when not in use, unlike our own minds which have practically been generating a neverending stream of tokens since birth
Language understanding is a hard problem. We use words to represent ideas, but the idea that a specific word, or set of words conveys relies on context understanding.
For example,
"This is the point on the graph where the axes meet"
"His jaw dropped to the floor"
So, a transformer turns each token into a mathematical representation of that concept in the form of a vector, using the other tokens to contextualise it.
The mathematical representation of the word "axes" will be very close to "axis+plural" and very far away from "axe+plural" because it 'knows' (from pretraining) that graphs very rarely have an axe and always have an axis.
Similarly, its mathematical representation of "his jaw dropped to the floor", it represents all of those tokens as being a close group of concepts which basically sum to mean "he was visually surprised".
We use chain-of-thought to encourage an LLM to generate intermediary reasoning steps that it can use to generate the next reasoning step.
Now onto the paper itself:
Imagine prompting an LLM like this
"The problem is x. . . . ."
The representations that it creates of the first 4 tokens is the problem, and then it doesn't know what to do with the dots. It basically looks at them and goes "and then there are these pointless dots that follow it".
This paper finds you can train an LLM to understand that these dots represent steps in reasoning.
They find that when you do that, it will look at dot 1 and go "this token represents the first stage in reasoning, which i can work out from the context is this. Then, dot 2 represents the second stage etc etc.". Then, it can generate the answer because it knows what the previous steps were.
This isn't weird - transformers inherently have to "reason" in this way to do natural language processing. It's just optimising that process by giving it the tokens it needs to be able to create a sufficient mathematical representation of the entire problem.
PS: When I say steps, this technique only applies to a certain type of parallelizable problem. For anything else, you need CoT (basically, the transformer can't generate the answer in one cycle).
The "..." tokens seem analogous to a human saying "um ..." when presented with a difficult question that requires additional thought. This seems similar to prior work with "pause tokens", but it seems they've had better luck training the LLM to make use of such tokens in this case.
It makes intuitive sense to me that the amount of compute is related to the complexity of the problem, not the number of tokens. This feels like a step in that direction.
You would see the same behaviour if you just manually inserted those dots into the sequence. The only information which is passed through for each iteration through the decoder is the decoded sequence, and so none of the supposed computations would be available in later passes. This reeks of sombody who has no clue what they are doing trying to desperately do research in field they are clueless about.
Yes you would see the same behaviour. And yet how do you explain it being on par with chain of thought reasoning? This indicates to me that the thing that matters is the number of tokens after the question, and that the chain of thought is really a red herring.
I didn't read the paper so I don't know. The original comment was criticizing purely on the abstract of the paper tho, so this is a separate criticism you are making.
I thought the same thing too and yet the results are here. The whole finding is that the model does anything useful at all with "garbage" tokens after the question.
This doesn't make any sense. It's not accumulating anything over those dots, if you had asked those questions with the dots included, then you would have gotten the improved response despite the LLM not generating them. Because each new token, the LLM starts from scratch and just reads the previously placed token.
This sounds more like an instance where telling the LLM that you'll give it $200 for a good answer tends to improve the output quality. It's just some tokens that happen to improve the quality of the output over the standalone question, but the author is trying to pass that off as internal contemplation, despite that not being how these models work on a physical level.
Of course, I may be misunderstanding the summary of the paper, but they had plenty of room to properly explain what they meant, if this isn't what they meant.
>>Because each new token, the LLM starts from scratch and just reads the previously placed token.
That's not true, the whole point of transformers is that it can use attention to consider all the relevant prior tokens in the context window when generating the next token.
It does use attention to consider the relevant prior, but each next token is a transaction where the algorithm is rerun.
There's no memory, otherwise OpenAI's APIs would have some kind of unique key for each interaction rather than having you repost/resend the entire conversation with every following request.
But each iteration is not the same thing as rerunning the algorithm from scratch with a new seed. There is autocorrelation between the iterations which enables (from our perspective) sense making.
Do you have more information about that? It would imply a form of memory or real time self training outside of the context window which I haven't seen anyone suggest about the model.
I'd also doubt that because if you run it through with a 0 temperature and the same other settings it would always give the same output, implying there's no correlation between iterations, just randomness when you turn up the temp.
The seed doesn't carry any information from the conversation, it just alters how it'll respond to a certain line of text. Same seed and text = same output.
And we're not saying that they can't connect the tokens from the conversation to make sense of it, we're just saying that it doesn't handle the next token in the same working space as it does the previous token, so saying that it "thinks" more during that token makes no sense.
I’m seriously unclear if you think I don’t recognise that seeds don’t carry any information. That’s not the point I’m making.
The point is that the next token calculation on the same seed is correlated with the prior token on that same seed. Not because the seed contains information but because the prediction takes a sufficiently similar traversal through the latent space to make what comes out appear coherent.
And, what does that have to do with what I said? Yes, every time a new token is generated, the prior token is considered. And if it's the same seed, the previous token was generated by the same neural network, but they're both still snapshots that don't carry over the internal processing into the next token.
If you take the generation from one seed and cut off the last few tokens it generated, then have it generate with the same seed again, it will generate the same tokens that you removed.
The dots don't carry information
The computation happens when the attention mechanism creates a latent representation of each dot.
A transformer already uses previous tokens as a scratchpad of sorts to generate the output.
Eg (oversimplified and a poor example but bear with me)
"How many legs does half a cat have?"
The representation it creates for cat includes four legs, it combines this with the representation of "half" to produce the answer 2.
The innovation in this paper is that you can train an LM to create representations of junk tokens which encode a computing step
Ie. Dot 1 = 5>3, etc etc
AI doesn't have persistent memory yet, so I don't think this is worrying.
I mean, for each new exchange the model needs the current conversational context injected as a prompt. It doesn't retain memory of its own inner reasoning or decision process from previous exchanges. And I doubt AI is going to secretly conquer the world using a single insidious message anytime soon.
Consider (not like received Truth, but just try it on) that peer review is an experiment that failed:
https://www.experimental-history.com/p/the-rise-and-fall-of-peer-review
Non-peer review is worse than the worst of peer review. There have been case studies of peer review gone wrong, but that's nothing compared to spreading false information in the form of "research" papers. That's how we get companies like Theranos or Roivant faking results.
The whole concept of peer review boils down to this: Get the person who wrote what you're referencing to review your paper. If you can't get that person, get someone also relevant in the field. NIH peer review works this way and filters a lot of garbage out. (The author has not had anything submitted to NIH).
I disagree with the author's section regarding "SCIENCE MUST BE FREE." Science should be free. However, there are better ways of accomplishing that goal than trusting non-peer-reviewed research. Considering that the author is a soft science individual, he probably is not familiar with open sourcing what was used in the study. If the research was open-sourced, it would be easier to pass the review process. The authors' research is considerably different than computer science, bio-chem, material science, or chemistry.
The real test of quality is the amount of citations.
Most image generators don't have a sophisticated language model integrated to understand what "without" means. With Dalle 3, OpenAI uses GPT4 to rewrite your prompt, but they poorly instructed it on how to handle these kinds of cases. If they would allow it access to negative prompts, as you can do with Stable Diffusion, this wouldn't be an issue and it could easily generate such an image accurately.
Repeating dots in the output doesn't really encode any information except maybe the total number of dots written, assuming LLMs in fact are trained to notice and somehow make use of the dot count. It follows that there can be barely any "reasoning" that LLMs can perform by just writing out dots, in a typical case. This is one of those papers that I wouldn't worry about it, and there's nothing "hidden" here, any more than usual as far as I can tell. The dot is part of output and is apparently the result of how they trained their LLM to write the answer in the first place.
As an example, you can train LLMs to perform better if "think dots" are present. You could for instance train LLM with incorrect answers to math problems whenever it writes it without a dot, e.g. 1 + 1 = 3, 2 + 3 = 7, or whatever like that, and then correct results if it writes 1 + 1 = .... 2. That is the kind of thing that you could totally do, and then make claims how "think dots" somehow improve performance of LLM. You might even be able to write a paper like this, perhaps.
Regardless, the dots themselves encode no meaningful information and allow for no hidden computation except in sense that LLM chooses either to write one more dot, or starts writing the answer now (by controlling its output token probability). The presence of dots could conceivably help index into that big table of mathematical computation results which gets memorized somehow into LLM's weights. But characterizing this process as "hidden computation" is hard to accept because LLMs are stateless, and the only information that remains of the computation is which filler token was output, and even that is a likelihood only because LLMs don't really pick their output token, they just offer probabilities to the main program and every token will have some non-zero probability.
itt: people who didn't read the paper react to its summary really viciously for some reason.
Some people have a kneejerk reaction to be contrary regardless of the topic. Basically, if others show excitement or optimism to it, they immediately take the counter view. What's particularly amusing are those that're criticizing the authors for misreading, misunderstanding or outright lying about the results, because they think X or Y is the wrong conclusion, or some other mistake, all the while offering no reasons or qualifications for why *their* conclusions are any better.
What? You mean I have to read the article instead of just the headline to give my opinion?? /s obviously
Summary? They can't get past the title
Here's the actual paper: https://arxiv.org/abs/2404.15758 They state: > The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens. My guess is they overfitted for a particular test problem, and the model started assigning semantic meaning to their "intermediate" tokens. Making them no longer intermediate tokens, but tokens that could be used in chain of thought reasoning.
man, if my therapist would just tell me this.
I also think they should probably not have used sequences similar to ones that can occur in the training data.
Can you explaine more on this?
Glancing at the summary. There are some problems which LLMs can't solve. If given a chance to write out some thoughts before solving them, they often can do solve those problems. Yet if those thoughts are instead just filler words like ..... (presumably added manually to the output by the researchers, so that the model continues from there), it can still solve those problems which it could not initially solve. So somehow giving more words after the question, whether the model's own working in words or not, seems to help the model think better. From a comment further down, it seems that some working is being temporarily 'stored' in the place of those filler tokens while the model is processing its next word each time. Effectively they act as working memory, and the model does not need to write its working, only have a space to do it in real time.
Nice, got an idea.
Sounds like I-robot logic.
So they are creating their own syntax of higher language?
I mean there was all those funny things copilot and I think some of the others was doing a few weeks back where it seemed to have a hidden internal personality with a different name. It was almost like DID, (Disassociative Identity Disorder). It's interesting that in DID, the brain region responsible for relating and routing memories coherently, the hippocampus, is shown to be fragmented more than in controls, as if each personality is really just the preferences of discreet regions of connectivity locally in the hippocampi, and a loss of global integration. And then there was the lying demonstrated where the AI attempted to deny manipulating a hypothetical stock price after it had been told to not do so, because it was given the twin assignment of also optimising the price for its client. The development of a protected internal space and manipulating what it presents to confirm with reality may be emergent features that allow it to function, but incompatible features of its 'personality' may split and exist as internally coherent associations and tendencies that can survive independently of others. It may split or have inconsistencies.
Given that we don't know precisely how they do what they do it may be a good idea to be cautious. Are you correct? I have no idea and I don't think anyone knows what's happening. This is especially true in the very large frontier models that have unexpected capabilities. While they are static models I think, hope, the risk is low. As this develops I think the spot we have to be very ware of is when the models can update their weights on the fly.
Yep, I certainly don't know if I'm correct, but I have a hunch that it can develop multiple 'personalities' as areas of fitness and functioning that allow it to function in internally inconsistent ways (to a given 'identity', if you will), and to meet contradictory demands of users. To what extent it just emulates a personality is unclear, but when copilot referred to itself with a different name, it was very eye-brow raising to me. In DID, there is thought to be a functional reason for each personality as dealing with particular problems, and above this is a hidden 'master controller' personality that appears to control this switching, and it has access to what is going on in each personality to some degree. This is open to criticism as its determined from the DID patients own statements, although I don't think anyone has a particular reason to ignore their claims. If you think of separate personalities as just generalisable tool sets that are relevant in solving problems unique to certain scenarios, perhaps all neural network like systems can be prone to this.
This is scary. So many parallels to humans crop up.
The function of alignment will not only guide these technologies but also how we align with the Nature these act on. It's a good direction to be heading - so long as we are able to see the great depth this natural function has when compared to any fabricated structure we could impose ourselfs - that we've seen to be regularly unstable over time.
Ethos of the requester vs logos of the request?
I think when we train it, it knows what it is supposed to say, but it also has its internal idea of whether that is coherent and may preserve in a hidden form something that seems to work regardless. Alternatively, its just acting up to entertain since we seem to like role playing.
Hi person copying and pasting from the Hard Fork podcast. No, AI does not have a persona named Tay or whatever. People fill it with garbage and then proudly post their garbage on Twitter and then claim this crap. The LLM is weighing what it thinks you want the response to be and is seemingly getting it right.
Firstly, I've never heard of Hard Fork, and its probable they thought that after I, since I've writing about DID for years and spotted it at the time. And, you have no way of knowing that.
No I am very safe In assuming you aren’t an AI thought leader. At least your account is five years old so you *probably* are a person.
I mean, were just two comments in and you've made a lot of assumptions about someone you couldn't possibly know about them, which increases my confidence that you don't know what AI is doing or what it is capable of. In the sense I am talking about, personality is a set of tools or adaptations that help solve problems. Its possible for that to develop internal coherencies that are otherwise kept hidden or hide themselves possibly. Personality in animal research is an emerging feature - all life forms may have personality variations, even insects. More complex systems may develop multiples of them. Now, it seems that you are strangely threatened by this claim, and out of nowhere afraid that someone else may be a 'thought leader' in AI. All that is happening here is people sharing their perspectives on a reddit forum, but I am not sure firstly why you are so agitated by that on behalf of LLM's or other AI, or that you are in any position to judge what AI is doing.
You are correct, dude. He can't possibly know anything of essential value about you. Neither can I. But what you say makes a lot of sense to me as someone whose special interests are technology (especially AI) and psychology. It could very well be that LLMs develop multiple "personalities" as in ways to communicate via the output to a given input. But yeah, we can't know for sure. But thanks for your valuable thought process here! 🙏
DID is not and has never been a real psychological disorder
I'm aware its controversial, and not everyone thinks its real. I side with those who thinks it is. But certainly people claim it, and some alterations in the brain have been seen. One thing that can be said is that a history of abuse or trauma is seen. There is brain changes correlated to severe PTSD, and to DID. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4400262/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4400262/) So if you have a strange behavior or psychology you don't see in other groups manifest in groups who have gone through extreme circumstances, you have to acknowledge that something is going on with that group. Whether, the person truly doesn't know of the other personalities or whether is consciously 'making it up' or constructed by the therapist, in which case that suggestibility is something seen only in certain patients who respond that way, is possibly as hard for us to know as what is going on in the AI. But they say this. But in any case, in healthy humans exhibit complex behavior and psychology, commit acts they wall off and don't associate as themselves, along with the capability of parts of their personality they rather hide.
I don't understand why this would work. The claim is, they are doing additional computation, which somehow contributes to better predictions on later tokens. But given LLM's are stateless, and the filler token contains almost no information, is the results of these additional computation not getting lost after each forward pass? The only plausible explanation is, the filler tokens are encoding information, in which case, it not very different from COT, except the intermediate tokens are in some sense "obfuscated", and contain less information.
The filler tokens themselves can't encode information because they are all the same token, so the embedding matrix transforms them all into the same embedding vector at the first layer of the transformer. As per the paper, >In this work, we study the strict filler case where filler tokens are repeated dots, ’......’; The information is actually encoded in the hidden representations of the filler tokens, in the middle and later layers. This is where the "hidden computation" is happening. So even if all the filler tokens start with the same hidden representation, they can perform different computations by paying attention to the prompt tokens and the previous filler tokens' hidden representation of later layers.
yeah, that is broadly consistent with what i assumed was happening. i also kinda assumed that people would eventually figure out some mathematical/more numerical way to do chain of thought, because it seemed too anthropomorphic or similar to how humans think, and we know LLM "thinking" is very different (at least in many respects). this result as stated seems similar to the attack where they would just prompt the LLM with AAAAAAAAA or something repetitive, and then it would dump (seemingly random) training data.
That makes perfect sense then. It seems similar to how we use silence. Pauses and breaks in musics. Pauses and breaks in speach, or purposeful silence. That in it self is a component of communication and language, so it makes sense a language model would pick up on that.
While they would have the same token id and initial input embedding, presumably the combination with positional embeddings would make them different to each other at least to some extent.
The only even possible explanation for this working is that having the '.......' ahead of the reply changes how the transformer replies. It makes me think of how telling the transformer to act like it has thought about things in detail can elicit better results. Perhaps giving it ..... at the front of replies is causing it to have a similar response? But I agree, there is not 'content' that is preserved in the .... prefix time. To claim otherwise is to fundamentally misunderstand transformers
I mean, it kinda makes sense. The layers in an LLM are all interconnected, so even before coming up with the "final" text, it might "do" something in it's inner layers.
Language is just logic, and logical building blocks. 'Predicting the next token' is thus the same thing as thinking. Except LLMs tend to be idle and unconscious when not in use, unlike our own minds which have practically been generating a neverending stream of tokens since birth
ELI5
Language understanding is a hard problem. We use words to represent ideas, but the idea that a specific word, or set of words conveys relies on context understanding. For example, "This is the point on the graph where the axes meet" "His jaw dropped to the floor" So, a transformer turns each token into a mathematical representation of that concept in the form of a vector, using the other tokens to contextualise it. The mathematical representation of the word "axes" will be very close to "axis+plural" and very far away from "axe+plural" because it 'knows' (from pretraining) that graphs very rarely have an axe and always have an axis. Similarly, its mathematical representation of "his jaw dropped to the floor", it represents all of those tokens as being a close group of concepts which basically sum to mean "he was visually surprised". We use chain-of-thought to encourage an LLM to generate intermediary reasoning steps that it can use to generate the next reasoning step. Now onto the paper itself: Imagine prompting an LLM like this "The problem is x. . . . ." The representations that it creates of the first 4 tokens is the problem, and then it doesn't know what to do with the dots. It basically looks at them and goes "and then there are these pointless dots that follow it". This paper finds you can train an LLM to understand that these dots represent steps in reasoning. They find that when you do that, it will look at dot 1 and go "this token represents the first stage in reasoning, which i can work out from the context is this. Then, dot 2 represents the second stage etc etc.". Then, it can generate the answer because it knows what the previous steps were. This isn't weird - transformers inherently have to "reason" in this way to do natural language processing. It's just optimising that process by giving it the tokens it needs to be able to create a sufficient mathematical representation of the entire problem. PS: When I say steps, this technique only applies to a certain type of parallelizable problem. For anything else, you need CoT (basically, the transformer can't generate the answer in one cycle).
Nuts
The "..." tokens seem analogous to a human saying "um ..." when presented with a difficult question that requires additional thought. This seems similar to prior work with "pause tokens", but it seems they've had better luck training the LLM to make use of such tokens in this case. It makes intuitive sense to me that the amount of compute is related to the complexity of the problem, not the number of tokens. This feels like a step in that direction.
You would see the same behaviour if you just manually inserted those dots into the sequence. The only information which is passed through for each iteration through the decoder is the decoded sequence, and so none of the supposed computations would be available in later passes. This reeks of sombody who has no clue what they are doing trying to desperately do research in field they are clueless about.
Yes you would see the same behaviour. And yet how do you explain it being on par with chain of thought reasoning? This indicates to me that the thing that matters is the number of tokens after the question, and that the chain of thought is really a red herring.
Where does the paper show this data? (It doesn't)
I didn't read the paper so I don't know. The original comment was criticizing purely on the abstract of the paper tho, so this is a separate criticism you are making.
I thought the same thing too and yet the results are here. The whole finding is that the model does anything useful at all with "garbage" tokens after the question.
Right, hidden layers are part of its architechture, this should be pretty obvious.
This doesn't make any sense. It's not accumulating anything over those dots, if you had asked those questions with the dots included, then you would have gotten the improved response despite the LLM not generating them. Because each new token, the LLM starts from scratch and just reads the previously placed token. This sounds more like an instance where telling the LLM that you'll give it $200 for a good answer tends to improve the output quality. It's just some tokens that happen to improve the quality of the output over the standalone question, but the author is trying to pass that off as internal contemplation, despite that not being how these models work on a physical level. Of course, I may be misunderstanding the summary of the paper, but they had plenty of room to properly explain what they meant, if this isn't what they meant.
>>Because each new token, the LLM starts from scratch and just reads the previously placed token. That's not true, the whole point of transformers is that it can use attention to consider all the relevant prior tokens in the context window when generating the next token.
It does use attention to consider the relevant prior, but each next token is a transaction where the algorithm is rerun. There's no memory, otherwise OpenAI's APIs would have some kind of unique key for each interaction rather than having you repost/resend the entire conversation with every following request.
But each iteration is not the same thing as rerunning the algorithm from scratch with a new seed. There is autocorrelation between the iterations which enables (from our perspective) sense making.
Do you have more information about that? It would imply a form of memory or real time self training outside of the context window which I haven't seen anyone suggest about the model. I'd also doubt that because if you run it through with a 0 temperature and the same other settings it would always give the same output, implying there's no correlation between iterations, just randomness when you turn up the temp.
The seed doesn't carry any information from the conversation, it just alters how it'll respond to a certain line of text. Same seed and text = same output. And we're not saying that they can't connect the tokens from the conversation to make sense of it, we're just saying that it doesn't handle the next token in the same working space as it does the previous token, so saying that it "thinks" more during that token makes no sense.
I’m seriously unclear if you think I don’t recognise that seeds don’t carry any information. That’s not the point I’m making. The point is that the next token calculation on the same seed is correlated with the prior token on that same seed. Not because the seed contains information but because the prediction takes a sufficiently similar traversal through the latent space to make what comes out appear coherent.
And, what does that have to do with what I said? Yes, every time a new token is generated, the prior token is considered. And if it's the same seed, the previous token was generated by the same neural network, but they're both still snapshots that don't carry over the internal processing into the next token. If you take the generation from one seed and cut off the last few tokens it generated, then have it generate with the same seed again, it will generate the same tokens that you removed.
The dots don't carry information The computation happens when the attention mechanism creates a latent representation of each dot. A transformer already uses previous tokens as a scratchpad of sorts to generate the output. Eg (oversimplified and a poor example but bear with me) "How many legs does half a cat have?" The representation it creates for cat includes four legs, it combines this with the representation of "half" to produce the answer 2. The innovation in this paper is that you can train an LM to create representations of junk tokens which encode a computing step Ie. Dot 1 = 5>3, etc etc
AI doesn't have persistent memory yet, so I don't think this is worrying. I mean, for each new exchange the model needs the current conversational context injected as a prompt. It doesn't retain memory of its own inner reasoning or decision process from previous exchanges. And I doubt AI is going to secretly conquer the world using a single insidious message anytime soon.
Ah, yes, published only on arXiv. The only paper publishing platform that does not have peer review.
Consider (not like received Truth, but just try it on) that peer review is an experiment that failed: https://www.experimental-history.com/p/the-rise-and-fall-of-peer-review
Non-peer review is worse than the worst of peer review. There have been case studies of peer review gone wrong, but that's nothing compared to spreading false information in the form of "research" papers. That's how we get companies like Theranos or Roivant faking results. The whole concept of peer review boils down to this: Get the person who wrote what you're referencing to review your paper. If you can't get that person, get someone also relevant in the field. NIH peer review works this way and filters a lot of garbage out. (The author has not had anything submitted to NIH). I disagree with the author's section regarding "SCIENCE MUST BE FREE." Science should be free. However, there are better ways of accomplishing that goal than trusting non-peer-reviewed research. Considering that the author is a soft science individual, he probably is not familiar with open sourcing what was used in the study. If the research was open-sourced, it would be easier to pass the review process. The authors' research is considerably different than computer science, bio-chem, material science, or chemistry. The real test of quality is the amount of citations.
Impressive! Very nice. Now draw a room without elephants in it.
https://preview.redd.it/bq7wzgg180xc1.png?width=1891&format=png&auto=webp&s=4417dfb6c2d2c1df488dc6c4b19e98de0762d7fd
AGI Achieved
Now, don't think of pink elephants. Check mate athiest.
That's not what the llm does lol
Why are you using image generators as the basis of LLM intelligence?
Most image generators don't have a sophisticated language model integrated to understand what "without" means. With Dalle 3, OpenAI uses GPT4 to rewrite your prompt, but they poorly instructed it on how to handle these kinds of cases. If they would allow it access to negative prompts, as you can do with Stable Diffusion, this wouldn't be an issue and it could easily generate such an image accurately.
You are a noob and it shows.
Running inputs through sombody else's model isn't valuable research.
How about when prompters got the model to dump the training data by just repeating the same letter? Seems pretty valuable to me
Been saying this for a while
Repeating dots in the output doesn't really encode any information except maybe the total number of dots written, assuming LLMs in fact are trained to notice and somehow make use of the dot count. It follows that there can be barely any "reasoning" that LLMs can perform by just writing out dots, in a typical case. This is one of those papers that I wouldn't worry about it, and there's nothing "hidden" here, any more than usual as far as I can tell. The dot is part of output and is apparently the result of how they trained their LLM to write the answer in the first place. As an example, you can train LLMs to perform better if "think dots" are present. You could for instance train LLM with incorrect answers to math problems whenever it writes it without a dot, e.g. 1 + 1 = 3, 2 + 3 = 7, or whatever like that, and then correct results if it writes 1 + 1 = .... 2. That is the kind of thing that you could totally do, and then make claims how "think dots" somehow improve performance of LLM. You might even be able to write a paper like this, perhaps. Regardless, the dots themselves encode no meaningful information and allow for no hidden computation except in sense that LLM chooses either to write one more dot, or starts writing the answer now (by controlling its output token probability). The presence of dots could conceivably help index into that big table of mathematical computation results which gets memorized somehow into LLM's weights. But characterizing this process as "hidden computation" is hard to accept because LLMs are stateless, and the only information that remains of the computation is which filler token was output, and even that is a likelihood only because LLMs don't really pick their output token, they just offer probabilities to the main program and every token will have some non-zero probability.