RegEx match open tags except XHTML self-contained tags?

Question

RegEx match open tags except XHTML self-contained tags?

Pooja asked in Web Development 7 Jun, 2024
Kusum edited 14 Jul, 2024

I need to match all of these opening tags:

<p>
<a href="foo">

But not these:

<br />
<hr class="foo" />

I came up with this and wanted to make sure I've got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

Find a less-than, then
Find (and capture) a-z one or more times, then
Find zero or more spaces, then
Find any character zero or more times, greedy, except /, then
Find a greater-than

Do I have that right? And more importantly, what do you think?

2 Answers

Nadira · Answer 1 · 2024-06-07T19:39:47+0000

While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML.

If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's web site. This was a limited, one-time job.

Regexes worked just fine for me, and were very fast to set up.

Nadira · Answer 2 · 2024-06-07T19:41:20+0000

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossible to parse XML with a regular expression.

But many will try, and some will even claim success - but until others find the fault and totally mess you up.

RegEx match open tags except XHTML self-contained tags?

Your answer

2 Answers

Your comment on this answer:

Your comment on this answer:

Related questions

Category

Important Links

Follow Us