Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group. In practice, this can be used to extract information like phone numbers or emails from all sorts of data.
Imagine for example that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as ^(IMG\d+\.png)$ to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern ^(IMG\d+)\.png$ which only captures the part before the period.
Go ahead and try to use this to write a regular expression that matches only the filenames (not including extension) of the PDF files below.
Task | Text | Capture Groups | |
capture | file_record_transcript.pdf | file_record_transcript | |
capture | file_07241999.pdf | file_07241999 | |
skip | testfile_fake.pdf.tmp |
Solution | We only want to capture lines that start with "file" and have the file extension ".pdf" so we can write a simple pattern that captures everything from the start of "file" to the extension, like this ^(file.+)\.pdf$. |