When working with files and resources over a network, you will often come across URIs and URLs which can be parsed and worked with directly. Most standard libraries will have classes to parse and construct these kind of identifiers, but if you need to match them in logs or a larger corpus of text, you can use regular expressions to pull out information from their structured format quite easily.
URIs, or Uniform Resource Identifiers, are a representation of a resource that is generally composed of a scheme, host, port (optional), and resource path, respectively highlighted below.
http://regexone.com:80/page
The scheme describes the protocol to communicate with, the host and port describe the source of the resource, and the full path describes the location at the source for the resource.
In the exercise below, try to extract the protocol, host and port of the all the resources listed.
Task | Text | Capture Groups | |
capture | ftp://file_server.com:21/top_secret/life_changing_plans.pdf | ftp file_server.com 21 | |
capture | https://regexone.com/lesson/introduction#section | https regexone.com | |
capture | file://localhost:4040/zip_file | file localhost 4040 | |
capture | https://s3cur3-server.com:9999/ | https s3cur3-server.com 9999 | |
capture | market://search/angry%20birds | market search |
Solution | We have to match each of the three components:
To put it all together, we then have the full regular expression (\w+)://([\w\-\.]+)(:(\d+))? to capture all the data we are looking for. |