RegexOne - Learn Regular Expressions - Problem 8: Parsing and extracting data from a URL

Problem 8: Parsing and extracting data from a URL

When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.

URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.

http://regexone.com:80/page

The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.

In the exercise below, try to extract the protocol, host and port of the all the resources listed.

Exercise 8: Extracting data from URLs

Task	Text	Capture Groups
capture	ftp://file_server.com:21/top_secret/life_changing_plans.pdf	ftp file_server.com 21
capture	https://regexone.com/lesson/introduction#section	https regexone.com
capture	file://localhost:4040/zip_file	file localhost 4040
capture	https://s3cur3-server.com:9999/	https s3cur3-server.com 9999
capture	market://search/angry%20birds	market search

Solution

We have to match each of the three components:

the protocols in our list are all alphanumeric, so they can be matched using (\w+)://
The hosts can contain non-alphanumeric characters like the dash or the period, so we will have to specifically include those characters using ://([\w\-\.]+)
The port is an optional part of the URI and is preceeded with a colon and can be matched using (:(\d+))

To put it all together, we then have the full regular expression (\w+)://([\w\-\.]+)(:(\d+))? to capture all the data we are looking for.

Solve the above task to continue on to the next problem, or read the Solution.

Lesson Notes

	abc…	Letters
	123…	Digits
	\d	Any Digit
	\D	Any Non-digit character
	.	Any Character
	\.	Period
	[abc]	Only a, b, or c
	[^abc]	Not a, b, nor c
	[a-z]	Characters a to z
	[0-9]	Numbers 0 to 9
	\w	Any Alphanumeric character
	\W	Any Non-alphanumeric character
	{m}	m Repetitions
	{m,n}	m to n Repetitions
	*	Zero or more repetitions
	+	One or more repetitions
	?	Optional character
	\s	Any Whitespace
	\S	Any Non-whitespace character
	^…$	Starts and ends
	(…)	Capture Group
	(a(bc))	Capture Sub-group
	(.*)	Capture all
	(abc\|def)	Matches abc or def