Join Scott Simpson for an in-depth discussion in this video Working with text: Regular expressions, part of Linux Tips Weekly.
- [Narrator] Patterns are an important part of working with text. And while humans are great at seeing and discovering patterns, computers need a little bit of help. Across many programming languages and in many scripting languages, like Bash on a Linux machine, it's useful to be able to tell a computer how to look for a given pattern. One way of representing patterns to a computer is by using regular expressions. The term "regular expression" can sound a little strange. The name comes from the mathematical roots of computer science, where formal language theory describes regular languages, which are languages that can be described by a regular grammar.
Which more or less, describes a language according to a set of rules. Regular expressions let us write a statement that a computer can use to match patterns and text. Regular expressions, or Regex's, as they're often called, are helpful in looking for certain types of information in text files, such as dates, email addresses, phone numbers and any other kind of information that fits a particular pattern. You might see something like this, in a list of email addresses, and immediately recognize that it's not valid. But a program sending email using this list might not know any better, and would still try to compose a message and send it, only to get an error back from a mail transport agent.
We need to tell the system that a valid mail address must not have a space in it, and that the top level domain needs to be more than one character long. So, while we can write a rule to share our understanding of what a valid email address looks like with a computer, generally speaking we can describe any text based pattern with a regular expression. To write a regular expression, we can use a set of meta characters, or characters that describe characters. And in some cases, we'll use literal characters as well. Some common meta characters that we'll use in a Regex are dot, question mark, plus and star.
Dot, or the period character, represents one instance of any character except a new line. The question mark matches exactly zero or one of the character or expression which precedes it. Plus matches one or more of whatever precedes it, and star or asterisk, matches zero or more of whatever precedes it. You'll also see parentheses, used for grouping. Curly braces, used to indicate how many of a given expression we want to match. And characters that represent the beginning and the end of a line of text. We can be a little more precise about the characters we want to match by specifying literal or explicit characters.
Or by setting ranges of characters, using square brackets. In a regular expression, letters are case sensitive. So lowercase a through z here would only match lowercase letters. And to match uppercase letters too, we'd need to add capital A through Z in there as well. This will match any of the Latin characters. These ranges also work with numbers. These make up most of the basics of writing a Regex, though there are other elements you can use too. Let's take a look at using Regex in the Shell, using the Grep command.
Grep is a powerful tool for matching text strings, and it's often used in conjunction with long text outputs, like file listings or log files. In order to search for or filter by certain information. Without any options, Grep works as standard text search, but with a dash capital E option, we can use regular expressions and see what they match. Let's use a text file that I have here to experiment with some basics. Alright Grep dash capital E, and then in quotes I'll put T, period and then the file name.
And this will match the letter T and one character after it. Or, I can switch it around, writing dot T. And match any combination of a character followed by a T. (keyboard clicking) Adding a star into mix, we can start to match a broader number of characters. This is T followed by zero or more of any characters, which matches to the end of each line.
And if we add another explicit character, this gives us a finite end to the matching. So this will match any sequence that starts with T, has any number of characters, up to the last T that it matches. I can use a range of characters too. (keyboard clicking) This will match any of the characters between T and Z.
And adding some curly braces, I can ask for only those strings of characters that match say, two of whatever my expression asks for. In this case, groups of two instances of the letters T through Z. That's a quick look, and that's it for this episode. Regular expressions are a huge topic, and they can get a lot more complex and a lot more powerful. Be sure to check out our courses that focus on regular expressions for a deeper dive. You'll find that an understanding of Regex's will make a lot of tasks easier, as you work with a command line or as a developer.
Note: Because this is an ongoing series, viewers will not receive a certificate of completion.