Problem 4: Matching HTML

If you are looking for a robust way to parse HTML, regular expressions are usually not the answer due to the fragility of html pages on the internet today -- common mistakes like missing end tags, mismatched tags, forgetting to close an attribute quote, would all derail a perfectly good regular expression. Instead, you can use libraries like Beautiful Soup or html5lib (both Python) or phpQuery (PHP) which not only parse the HTML but allow you to walk to DOM quickly and easily.

That said, there are often times when you want to quickly match tags and tag content in an editor, and if you can vouch for the input, regular expressions are a good tool to do this. As you can see in the examples below, some things that you might want to be careful about odd attributes that have extra escaped quotes and nested tags.

Go ahead and write regular expressions for the following examples.

Exercise 4: Capturing HTML Tags
Task Text Capture Groups  
capture <a>This is a link</a> a To be completed
capture <a href=''>Link</a> a To be completed
capture <div class='test_style'>Test</div> div To be completed
capture <div>Hello <span>world</span></div> div To be completed

It is a best practice to use a proper library to parse html, but to find simple tag names, you can use the expression <(\w+).

You can also capture tag contents >([\w\s]*)<, or even attribute values ='([\w://.]*)' if desired (not the goal of this problem though).

Solve the above task to continue on to the next problem, or read the Solution.