Join Kevin Skoglund for an in-depth discussion in this video What are regular expressions?, part of Learning Regular Expressions.
Let's begin by answering the question, what are regular expressions? The name is opaque; it doesn't really give its meaning away. Regular expressions are all about text. Take just a moment and realize how much text is all around you in our modern digital world: email, news stories, text messages, stock market data, computer code, contacts in your address book, tags of people in photographs--all these things are text. Regular expressions are a tool that allows us to work with these text by describing text patterns.
So a regular expression is a set of symbols that describes a text pattern. Now it's the singular. When we see it pluralized, what we're talking about is the formal language of these symbols that needs to be interpreted by a regular expression processor. We'll talk a bit more about processors in just a moment, but the processor is what's going to use those symbols to allow us to match, search, and manipulate text. Now let's also take a moment and talk about what regular expressions are not. They're not a programming language. They may seem similar because they are a formal language with a defined set of rules that gets a computer to do what we want it to do.
Most programming languages use regular expressions and programmers probably use them the most, but there are no variables and you can't add 1+1. It's not a programming language. What they are are symbols that describe a text pattern, and that's it. Frequently, you'll hear them regex for short. Sometimes you'll see it written with a p at the end, but that's really not that common; more often you see it without. You'll hear me say regex throughout this tutorial, and you'll even hear it pluralized as regexes. It's just a lot shorter and simpler to say than regular expressions, which is a bit of a mouthful.
Next let's talk about ways that you might use regular expressions to work with text. You might use them to test if a phone number has the correct number of digits, if an email address is in a valid format. You could search a document for color spelt either with or without the U. You could search a document and replace all occurrences of Bob, Bobby, or "B." with Robert, count the number of times in a document that training is immediately preceded by the words "computer," "video," or "online," only in those cases, only training when those words precede it.
You could use it to convert a tab-delimited file into a comma-delimited file or to find duplicate words in a text. In each of these cases, we're going to use a regular expression to write up a description of what we're looking for using symbols. In the case of a phone number, that pattern might be three digits followed by a dash, followed by three digits and another dash, followed by four digits. Once we've defined our pattern then the regex processor will use our description to return matching results, or in the case of the test, to return true or false for whether or not it matched.
Now that word matches is a keyword. We're going to be using it a lot. A regular expression matches text if it correctly describes the text. You can also flip it around and say that text matches a regular expression if it is correctly described by the expression. So you hear it both ways. So whether if something matches your regex, that's the verb we're going to be using a lot. Does it match, does it not match? We're going to learn to write all these examples and more, but before we begin learning the symbols that are required to write these expressions, let's first take a look at the history of regular expressions and get set up with an environment where we can test them out.
- Creating flexible patterns using character sets
- Achieving efficiency when using repetition
- Understanding different types of search strategies
- Writing logical and efficient alternations
- Capturing groups and reusing them with backreferences
- Developing complex patterns with lookaround assertions
- Working with Unicode and multibyte characters
- Matching email addresses, URLs, dates, HTML tags, and credit card numbers
- Using search and replace to format a document