Problem 8: Parsing and extracting data from a URL

When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.

URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.

The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.

In the exercise below, try to extract the protocol, host and port of the all the resources listed.

Exercise 8: Extracting data from URLs
Task Text Capture Groups  
capture ftp 21 To be completed
capture https To be completed
capture file://localhost:4040/zip_file file localhost 4040 To be completed
capture https 9999 To be completed
capture market://search/angry%20birds market search To be completed

We have to match each of the three components:

  • the protocols in our list are all alphanumeric, so they can be matched using (\w+)://
  • The hosts can contain non-alphanumeric characters like the dash or the period, so we will have to specifically include those characters using ://([\w\-\.]+)
  • The port is an optional part of the URI and is preceeded with a colon and can be matched using (:(\d+))

To put it all together, we then have the full regular expression (\w+)://([\w\-\.]+)(:(\d+))? to capture all the data we are looking for.

Solve the above task to continue on to the next problem, or read the Solution.