RegexOne - Learn Regular Expressions - Problem 4: Matching HTML

Problem 4: Matching HTML

If you are looking for a robust way to parse HTML, regular expressions are usually not the answer due to the fragility of html pages on the internet today -- common mistakes like missing end tags, mismatched tags, forgetting to close an attribute quote, would all derail a perfectly good regular expression. Instead, you can use libraries like Beautiful Soup or html5lib (both Python) or phpQuery (PHP) which not only parse the HTML but allow you to walk to DOM quickly and easily.

That said, there are often times when you want to quickly match tags and tag content in an editor, and if you can vouch for the input, regular expressions are a good tool to do this. As you can see in the examples below, some things that you might want to be careful about odd attributes that have extra escaped quotes and nested tags.

Go ahead and write regular expressions for the following examples.

Exercise 4: Capturing HTML Tags

Task	Text	Capture Groups
capture	<a>This is a link</a>	a
capture	<a href='https://regexone.com'>Link</a>	a
capture	<div class='test_style'>Test</div>	div
capture	<div>Hello <span>world</span></div>	div

Solution

It is a best practice to use a proper library to parse html, but to find simple tag names, you can use the expression <(\w+).

You can also capture tag contents >([\w\s]*)<, or even attribute values ='([\w://.]*)' if desired (not the goal of this problem though).

Solve the above task to continue on to the next problem, or read the Solution.

Lesson Notes

	abc…	Letters
	123…	Digits
	\d	Any Digit
	\D	Any Non-digit character
	.	Any Character
	\.	Period
	[abc]	Only a, b, or c
	[^abc]	Not a, b, nor c
	[a-z]	Characters a to z
	[0-9]	Numbers 0 to 9
	\w	Any Alphanumeric character
	\W	Any Non-alphanumeric character
	{m}	m Repetitions
	{m,n}	m to n Repetitions
	*	Zero or more repetitions
	+	One or more repetitions
	?	Optional character
	\s	Any Whitespace
	\S	Any Non-whitespace character
	^…$	Starts and ends
	(…)	Capture Group
	(a(bc))	Capture Sub-group
	(.*)	Capture all
	(abc\|def)	Matches abc or def