When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.
URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.
The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.
In the exercise below, try to extract the protocol, host and port of the all the resources listed.
|capture||ftp://file_server.com:21/top_secret/life_changing_plans.pdf||ftp file_server.com 21|
|capture||file://localhost:4040/zip_file||file localhost 4040|
|capture||https://s3cur3-server.com:9999/||https s3cur3-server.com 9999|
We have to match each of the three components:
To put it all together, we then have the full regular expression (\w+)://([\w\-\.]+)(:(\d+))? to capture all the data we are looking for.