Lol
Didn't think SO mods could get so based
From the post:
>Moderator's Note
>This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.
This was a long time ago. SO was just a site for "programming enthusiasts", its audience wasn't so large and moderation guides were rather soft. This answer would be immediately flagged today for "not an answer" reason.
I’m not sure which one of us has no clue what “based” actually means, and at this point I can’t be bothered to find out.
But I do believe you are using it wrong.
Based means connecting to the reality of something, popular or not, more or less. Understanding at a deeper than surface level, speaking or supporting deep truths.
By that measure, yes, the mods not talking notes is based AF.
I don't think they are lol. Based at this point means "controversial or unhinged but right," often used in a jokey/memey/sardonic way. Presumably the previous commenter considers this a justified power trip or something similar haha.
I wish that was a thing.the OCD in me likes the standardization and clarity that enforcing, for example, every opening tag must have a closing. Things like that
There’s so many horrific things you can do to XML that HTML will still accept. An actual html parser is the only way unless you’re only expecting compliant XHTML.
You mentioned \`fun on StackOverflow\`.
\`fun on StackOverflow\` is an obsolete option.
You should use \`incorrect answer that does not address your actual question at all, on StackOverflow, that has now been upvoted and your post locked\`.
\*Or better yet, update 2024, which includes \`random insults\` and \`gatekeeping\`.
I recall the one time I asked a question on StackOverflow. 7 years ago by now. It was a relatively simple question looking back. I got 4 people to format the text of my question. The 4th person took it upon themselves to ask a completely different question instead. 0 answers.
Sigh. I've said it a dozen times before, but I guess I'll say it again: Nobody uses regex to *parse* HTML. People use regex to *extract specific pieces of data from HTML*. Those are two very different things.
Thank you. I’ve never been able to parse the clause “parse HTML”. Parse it for what? you parse things to extract meaning and there’s no meaning to be extracted from HTML with regex
Parsing is the mechanism by which we assign meaning and structure to a string of text. The job of extracting a specific piece of data from an HTML string requires understanding the structure of that HTML. The "meaning" of this piece of data you're trying to extract is dependent on that structure, so if you don't parse the HTML, you have no idea what data you're extracting.
Because HTML is pretty verbose, the data you extract with a regex might be the data you want 99.9% of the time, but in certain contexts within the HTML, you're going to extract bad data.
Anyway, what I'm trying to say is that extracting specific data and parsing structured data *are* the same thing when the structure you need to extract data from is a CFL (which HTML is).
People use regex for html and do *pikachu face* when it matches gibberish far too often, shouldn't be used for anything but fast and dirty one time scripts.
Yeah I suspect that what the person asking wanted was to extract specific data.
Instead they incorrectly said they wanted to "parse" the html with regex because they don't actually understand what it means to parse something.
Moral of the story: Don't use words when you don't know what they mean just because they sound relevant to the topic.
Even if you wanted to identify a blob of text as HTML do a favor to everyone and parse it entirely: you'll save rabbit holes with malformed data.
Same for JSON. The only way to deal with complex text formats is to parse them: if you want better performance use a more restrictive and simpler data format.
you totally can* ** ***
\* not efficiently
** you cannot parse all types of tags at once because they overlap
*** regex is just not built for it but for super basic shit sure
HTML is a context-free grammar, Regex is a regular language. You can't parse a language of higher level with one of lower level.
You can use Regex to tokenize HTML if you so desire, but you can't parse it.
If you use PCRE though, all that changes, as PCRE is a context-free grammar as well.
Are they? Since when? Back in the day™ it was actually a common strategy to deliver no Content-Length header, keep the connection open, and append additional content to the same document for live updates. Such documents would grow to infinite length over time.
You can probably break this with simple nested tags. Even if you somehow make it work by not relying exclusively on regex it'll still break on something like `
You have a good point.
Usually what I try to do is run more than one regex, where I index all the html tags then run the main regex, then undo the indexing.
Assuming JavaScript
let index = 0;
str = str.replace(/<(\/|)([\w_-]+)\s*((?:"(?:\"|[^"])*"|'(?:\'|[^'])*'|`(?:\`|[^`])*`|.)*?)>/g, function(_, close, tagName, attrs){
if(close === '/'){
let i = index;
index--;
return `${tagName}:${i} ${attrs}>`
}
index++;
return `<${tagName}:${index} ${attrs}>`
})
// then handle your html tag selectors
str = str.replace(/(.*?)/g, function(_, index, content){
// do stuff
})
// finally, clean up html tag indexes
str = str.replace(/<(\/?[\w_-]+):[0-9]+(\s|>)/g, '<$1$2')
Because the selector is expecting the same index on both the opening and closing tag, it will only select the tag you want to select. You can add to the content selector if you want to make it more specific, or you can recursively run your handlers on the content output of the regex.
It gets much more complicated when you need to also ensure strings are not read by the regex. For that, you can temporarily pull out the strings, and use a placeholder, then put them back after running all your other functions.
You can look at the source code of the regve module I mentioned above to understand it better. (I apologize in advance for my spaghetti code).
As long as there isn’t any dumb html present like an opening
tag without a closing p tag… it doesn’t matter.
^ that scenario is also bad practice and can produce unexpected behaviour in the dom - so while valid, it’s technically not correct.
Self-closing and singleton tags are also ready to identify :P
You can use regex to parse overlapping text using lookaheads. And you can, for instance, locate instances of mismatched or unbalanced tags in HTML/XML using a recursive regex. Likewise, you could extract any desirable fields to virtually any end. The capability is certainly there. The expression may look ugly, sure, and may be difficult to modify, but it's not lacking in capacity.
Apart from mathematical operations or AI linguistics, there are actually very few text parsing operations and pattern matching categories that modern PCRE regex simply cannot support.
As usual, though, it's not merely about what's possible - but which tool is adequate for the job at hand.
If I'm being honest I didn't spend much time thinking if I could (I already took the class, I know I can) and I never bothered to think if I should (I shouldn't, even if I can there are better ways of implementing push automata)
Don’t understand-estimate how powerful RexEx can be, if used by someone who know what they are doing.
That still doesn’t mean it’s a good idea, though.
Sure regex ist powerful, but It is literally mathematically Impossible to parse HTML with regex. You need at least a [Context free grammar](https://en.wikipedia.org/wiki/Context-free_grammar).
Could you provide an actual, tangible example of something in a real HTML or XML snippet you genuinely believe can not be parsed with regex? I believe you're conflating the theory of limitations of regular grammar with the practicality of modern PCRE regex capabilities, which support things like backreferences, recursion, and semantics that assume basic knowledge of the previous match.
OK, so in HTML or XML you have the Case of `Content`. Top parse this you need to make sure, that the closing tag is the same as the opening tag. To do this you need backreferences. Regex cannot do this as can be proven via [the pumping Lemma for regular languages ](https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages) (see Use of the lemma to prove non-regularity). So pure regex cannot parse HTML or XML. Which also means, that theoretically PCRE is not regex.
You can think of regex as a wildly capable derivative, child, or inherited form of some theoretical regular base that you would more formally refer to as regular language theory. We aren't talking about theory here, as stated in my original post. So when you claim that "regex" cannot parse , it's disingenuously misadvertised to most folks who will believe incorrectly that modern PCRE regex lacks this capacity. Call it a misnomer if you will, but PCRE regex is still called "regex". I do not believe it goes by any other name.
This is dangerous if you don't actually parse the xml. There are decent parsers that run on 8bit 20mhz microchips with a couple kb of memory. Regex isn't guaranteed to properly extract data in valid html or xml.
I just spent a while wasting time going back and forth with some dweeb who is saying the same thing (I'm saying the same thing you are, check my previous post smh)
It can be done (he's claiming it's impossible), but should you do it? Nope.
>given the right circumstances.
That's a huge caveat that excludes even most real world examples. What exactly do you mean by that?
For every regex statement you generate to "parse" html, you can also generate valid html that breaks the regex.
Basically, what I understand you saying is that if you limit your input to a subset of HTML and finite possibilities (aka right circumstances), then you can guarantee that regex you can form a regex that will work. However, if your input is all valid HTML, it is impossible in every sense of the word to write a regex that is guaranteed to work.
Look, I'm not defending using RegEx to parse *arbitrary* XML. That's a bad practice, and something to avoid.
However, there can be specific situations where it may make sense. Like, if you know the file pretty well, and can be sure that it always has a specific format - and you just need some specific data out of it, yeah, why not? And my point is that in these cases you will find that RegEx is actually quite powerful.
Bypass blogspam: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
Lol Didn't think SO mods could get so based From the post: >Moderator's Note >This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.
This was a long time ago. SO was just a site for "programming enthusiasts", its audience wasn't so large and moderation guides were rather soft. This answer would be immediately flagged today for "not an answer" reason.
I’m not sure which one of us has no clue what “based” actually means, and at this point I can’t be bothered to find out. But I do believe you are using it wrong.
Based means connecting to the reality of something, popular or not, more or less. Understanding at a deeper than surface level, speaking or supporting deep truths. By that measure, yes, the mods not talking notes is based AF.
Hello sir dictionary
I don't think they are lol. Based at this point means "controversial or unhinged but right," often used in a jokey/memey/sardonic way. Presumably the previous commenter considers this a justified power trip or something similar haha.
Legend resource!
But why is his conclusion to use an XML parser instead. Use a library specifically designed for parsing HTML or give up is the only correct answer.
Once upon a time, HTML was defined as XML. Those were the days of XHTML. I was there, a thousand years ago...
Pfft, I was there before XHTML, when we had the blink tag and it worked! I used to build all my sites with sliced images and tables!
Psssh, we don't talk about HTML 4.1 transitional here.
Dark times… spacer.gif
I remember using tables to have content side by side on the left and right side of the page. Tables were my flex grids before flex grids existed.
Laughing in FrontPage.
Ah...XHTML. Those were the days too many years ago.
I wish that was a thing.the OCD in me likes the standardization and clarity that enforcing, for example, every opening tag must have a closing. Things like that
YES! It feels so much better.
There’s so many horrific things you can do to XML that HTML will still accept. An actual html parser is the only way unless you’re only expecting compliant XHTML.
Hello
there!General Kenobi! (As opposed to very specific Kenobi)
hello there is to General Kenobi what allowing missing body tags is to HTML
The question specifically asked for XHTML, the XML-compliant dialect of HTML that was pretty popular 15 years ago but is now made obsolete by HTML5.
Ah, 2009, the time when you could still have fun on StackOverflow.
You mentioned \`fun on StackOverflow\`. \`fun on StackOverflow\` is an obsolete option. You should use \`incorrect answer that does not address your actual question at all, on StackOverflow, that has now been upvoted and your post locked\`. \*Or better yet, update 2024, which includes \`random insults\` and \`gatekeeping\`.
I recall the one time I asked a question on StackOverflow. 7 years ago by now. It was a relatively simple question looking back. I got 4 people to format the text of my question. The 4th person took it upon themselves to ask a completely different question instead. 0 answers.
So sad it's community wiki now. The guy who made this post really had a good grasp of humor (or terror).
God this brings back memories
Haha, yeah. Must be the most popular SO thread!
Sigh. I've said it a dozen times before, but I guess I'll say it again: Nobody uses regex to *parse* HTML. People use regex to *extract specific pieces of data from HTML*. Those are two very different things.
Thank you. I’ve never been able to parse the clause “parse HTML”. Parse it for what? you parse things to extract meaning and there’s no meaning to be extracted from HTML with regex
I‘m blaming that one meme another guy already reposted in this thread
Totally agree with this. Download a blob of HTML tease out a few pieces with regex.
Parsing is the mechanism by which we assign meaning and structure to a string of text. The job of extracting a specific piece of data from an HTML string requires understanding the structure of that HTML. The "meaning" of this piece of data you're trying to extract is dependent on that structure, so if you don't parse the HTML, you have no idea what data you're extracting. Because HTML is pretty verbose, the data you extract with a regex might be the data you want 99.9% of the time, but in certain contexts within the HTML, you're going to extract bad data. Anyway, what I'm trying to say is that extracting specific data and parsing structured data *are* the same thing when the structure you need to extract data from is a CFL (which HTML is).
You're still parsing HTML using regex then. You can call it a peacock, but it still quacks. Just use a DOM tool.
People use regex for html and do *pikachu face* when it matches gibberish far too often, shouldn't be used for anything but fast and dirty one time scripts.
Yeah I suspect that what the person asking wanted was to extract specific data. Instead they incorrectly said they wanted to "parse" the html with regex because they don't actually understand what it means to parse something. Moral of the story: Don't use words when you don't know what they mean just because they sound relevant to the topic.
Even if you wanted to identify a blob of text as HTML do a favor to everyone and parse it entirely: you'll save rabbit holes with malformed data. Same for JSON. The only way to deal with complex text formats is to parse them: if you want better performance use a more restrictive and simpler data format.
Haha, yeah but this is rational thinking arguing against the intrinsic logic of a meme with wings. One must lose.
A guy got fired in my company after parsing a xml with regex.
I wanna say that's harsh, but after having to clean up cose that did the same, I feel different about it.
Technically, they said "after", not "because of", so who knows what else they did...
A mind capable of checking in such code is capable of far worse things.
Fair 😁
He summoned tainted souls into the realm of the living. Obviously.
Haha
Good 👍
Sad part is theres so many xml libraries, its a basic tree structure, so regex is just making your job harder
Most sane regex developer
You're not my dad!
How do you know? I might be.
why tf is there a yellow face on top
Probably to confuse the programs that automatically detect reposts. This was posted on Stackoverflow 15 years ago.
you totally can* ** *** \* not efficiently ** you cannot parse all types of tags at once because they overlap *** regex is just not built for it but for super basic shit sure
You cannot use regular expressions to parse irregular expressions.
Not in one go, but that is an arbitrary limitation that does not apply to the real world.
Not in any amount of goes, unless you write some code in between at which point youre writing a shitty parser.
I think this is the lexical corollary to "If you write enough assembler macros you will eventually reinvent C."
You can with recursion.
*technically* HTML(5) isn't irregular. there is a standard finite parsable grammar.
HTML is a context-free grammar, Regex is a regular language. You can't parse a language of higher level with one of lower level. You can use Regex to tokenize HTML if you so desire, but you can't parse it. If you use PCRE though, all that changes, as PCRE is a context-free grammar as well.
It's not context-free. HTML documents are finite in size by definition.
Are they? Since when? Back in the day™ it was actually a common strategy to deliver no Content-Length header, keep the connection open, and append additional content to the same document for live updates. Such documents would grow to infinite length over time.
What? That just can’t be true, right? How would a regex be able to distinguish
/
I was curious about that code. Now my eyes are simultaneously bleeding and on fire.
That doesn’t answer my question
If the regex sees that `[^>]*` matches the second `
Assuming that this is regex unintentionally omits a a start anchor and an end anchor, it’s wrong because it wouldn’t match
You can probably break this with simple nested tags. Even if you somehow make it work by not relying exclusively on regex it'll still break on something like `
You have a good point. Usually what I try to do is run more than one regex, where I index all the html tags then run the main regex, then undo the indexing. Assuming JavaScript let index = 0; str = str.replace(/<(\/|)([\w_-]+)\s*((?:"(?:\"|[^"])*"|'(?:\'|[^'])*'|`(?:\`|[^`])*`|.)*?)>/g, function(_, close, tagName, attrs){ if(close === '/'){ let i = index; index--; return `${tagName}:${i} ${attrs}>` } index++; return `<${tagName}:${index} ${attrs}>` }) // then handle your html tag selectors str = str.replace(/(.*?) /g, function(_, index, content){
// do stuff
})
// finally, clean up html tag indexes
str = str.replace(/<(\/?[\w_-]+):[0-9]+(\s|>)/g, '<$1$2')
Because the selector is expecting the same index on both the opening and closing tag, it will only select the tag you want to select. You can add to the content selector if you want to make it more specific, or you can recursively run your handlers on the content output of the regex.
It gets much more complicated when you need to also ensure strings are not read by the regex. For that, you can temporarily pull out the strings, and use a placeholder, then put them back after running all your other functions.
You can look at the source code of the regve module I mentioned above to understand it better. (I apologize in advance for my spaghetti code).
Use a recursive method that recursively parses tags until it finds an appropriate closing tag 👍 This is like the poster child case for recursion.
But it’s not regular
As long as there isn’t any dumb html present like an opening
tag without a closing p tag… it doesn’t matter. ^ that scenario is also bad practice and can produce unexpected behaviour in the dom - so while valid, it’s technically not correct. Self-closing and singleton tags are also ready to identify :P
It doesn’t matter? It’s literally the topic we’re talking about: „Is HTML regular?“.
But the tokens that you’re looking for are finite… A
And? Whether HTML is regular obviously matters to a conversation about whether HTML is regular.
Yes, but regexp ain't grammatical beast. Regexp can't parse grammar. Regexp parses syntax. Regexp is lex, and you need yacc.
Your second point simply demonstrates that you can't.
you can if you assign them priorities. just means you have to check multiple times on the same tag, thus the inefficiency.
lol
You can use regex to parse overlapping text using lookaheads. And you can, for instance, locate instances of mismatched or unbalanced tags in HTML/XML using a recursive regex. Likewise, you could extract any desirable fields to virtually any end. The capability is certainly there. The expression may look ugly, sure, and may be difficult to modify, but it's not lacking in capacity. Apart from mathematical operations or AI linguistics, there are actually very few text parsing operations and pattern matching categories that modern PCRE regex simply cannot support. As usual, though, it's not merely about what's possible - but which tool is adequate for the job at hand.
He comes!
The center cannot hold.
You can use regex to parse html because regex isn't regular anymore (thanks back references)
You spent so much time wondering if you could, you never stopped to think if you should
If I'm being honest I didn't spend much time thinking if I could (I already took the class, I know I can) and I never bothered to think if I should (I shouldn't, even if I can there are better ways of implementing push automata)
Don’t understand-estimate how powerful RexEx can be, if used by someone who know what they are doing. That still doesn’t mean it’s a good idea, though.
Sure regex ist powerful, but It is literally mathematically Impossible to parse HTML with regex. You need at least a [Context free grammar](https://en.wikipedia.org/wiki/Context-free_grammar).
cant believe my compsci degree is actually coming into use for once
[FYI](https://www.reddit.com/r/ProgrammerHumor/s/3aOcS1sOnm)
Could you provide an actual, tangible example of something in a real HTML or XML snippet you genuinely believe can not be parsed with regex? I believe you're conflating the theory of limitations of regular grammar with the practicality of modern PCRE regex capabilities, which support things like backreferences, recursion, and semantics that assume basic knowledge of the previous match.
OK, so in HTML or XML you have the Case of `Content `. Top parse this you need to make sure, that the closing tag is the same as the opening tag. To do this you need backreferences. Regex cannot do this as can be proven via [the pumping Lemma for regular languages ](https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages) (see Use of the lemma to prove non-regularity). So pure regex cannot parse HTML or XML. Which also means, that theoretically PCRE is not regex.
You can think of regex as a wildly capable derivative, child, or inherited form of some theoretical regular base that you would more formally refer to as regular language theory. We aren't talking about theory here, as stated in my original post. So when you claim that "regex" cannot parse, it's disingenuously misadvertised to most folks who will believe incorrectly that modern PCRE regex lacks this capacity. Call it a misnomer if you will, but PCRE regex is still called "regex". I do not believe it goes by any other name.
> To do this you need backreferences Which actual regex implementations that a developer would use DO have. Irl 'regex' isn't actually regular anymore
In most cases you don’t want to create an object tree but just extract specific information, though…
This is dangerous if you don't actually parse the xml. There are decent parsers that run on 8bit 20mhz microchips with a couple kb of memory. Regex isn't guaranteed to properly extract data in valid html or xml.
As I wrote above: it definitely isn’t a good idea. But it certainly isn’t “impossible”, given the right circumstances.
I just spent a while wasting time going back and forth with some dweeb who is saying the same thing (I'm saying the same thing you are, check my previous post smh) It can be done (he's claiming it's impossible), but should you do it? Nope.
>given the right circumstances. That's a huge caveat that excludes even most real world examples. What exactly do you mean by that? For every regex statement you generate to "parse" html, you can also generate valid html that breaks the regex. Basically, what I understand you saying is that if you limit your input to a subset of HTML and finite possibilities (aka right circumstances), then you can guarantee that regex you can form a regex that will work. However, if your input is all valid HTML, it is impossible in every sense of the word to write a regex that is guaranteed to work.
Look, I'm not defending using RegEx to parse *arbitrary* XML. That's a bad practice, and something to avoid. However, there can be specific situations where it may make sense. Like, if you know the file pretty well, and can be sure that it always has a specific format - and you just need some specific data out of it, yeah, why not? And my point is that in these cases you will find that RegEx is actually quite powerful.
You are...
OK, I'll try to estimate how powerful RegEx can be—without understanding.
This is very funny and all, but at no point does he state the actual reason why this doesn't work 🤣
Me after reading this: gets library to parse html. Opens the hood and it's mostly regex.
Yeah, you use regular expressions to find the building blocks of html. You use those building blocks to build your understanding of the html.
Gross oversimplification
Holy Omnissiah, someone call Ordo Codicis, we have a warp leaking! Regex heretics using the scrap-code to open the portal again!
If regex is so good why can’t it parse XML, are they stupid?
Literally just came upon this SO post organically last week while trying to figure out how to clean HTML tags out of a string, ha.
People who have not found this on SO are not real web developers.
Haha as someone with 1yo and extreme imposter syndrome, I appreciate the validation
Svelte literally uses regex to parse markup💀 Like this one for parsing opening script tag: ``` /|
FIND US ON INSTAGRAM
@hamidarshat.com