Regular expressions are tricky to write and to test. You can make it easier by installing the regex-posix library.
- [Instructor] The purpose of regular expressions are to represent a pattern that can be identified within some text data. In the context of data analysis, there are a couple of important uses for regular expressions. One, to validate fields to make sure that all values within a particular column adhere to a particular format. And two, to search builds based on a particular pattern. Word processors and editor applications have a find and replace feature. You submit a bit of text to identify within a larger bit of text and the desired replacement.
The application will replace all of the found text with the desired text. Many of these applications now include regular expression support. Rather than submitting an exact sequence of characters which need to be found, we submit a pattern. This pattern defines what is considered valid or not using regular expressions. So, regular expressions are a mini language. They aren't limited to the Haskell language. Once you understand regular expressions, you should be able to translate that knowledge into other programming languages. Each programming language may implement the mini language of regular expressions with slight variations.
So you will need to properly test your expressions when moving from language to language. In this section, we are going to look at understanding the mini language of regular expressions. That includes Dots and Pipes, Atoms and Atom Modifiers, and Character Classes. In the last two videos of this section we are going to use regular expressions in real applications such as looking at regular expressions with a CSV file and looking at regular expressions with a SQLite3 database.
So, in this video we are going to cover two basic bits of regular expression syntax and those are Dots and Pipes. So, to begin our video, we are going to install the regular expression library in Pascal and we are going to introduce the dots and the pipe syntax. So, let's get started. I'm going to move over to my virtual machine and let's find the terminal, and we need to begin by installing the library and that is done with cabal install regex posix.
It'll take a moment for this library to install. Great, we're done. Let's create a new notebook and dive in. I'm going to begin a new notebook. I'm going to rename this regex learning, and we need to import the text regex posix library. And that will give us access to the equal tilde operator which is necessary to look at regular expressions.
Let's define a couple of strings in order to get us started. String one is going to be one fish two fish red fish blue fish, the title of a popular Dr. Seuss book that I like to use when teaching regular expression. And our second is going to be a classic, the quick brown fox jumps over the lazy dog. Alright. Now that we have a couple of strings and we have our library imported we now have access to the equal tilde operator which can be used to evaluate if a pattern exists in a string.
So, let's do a a quick couple of examples. This is going to be a very simple does a string exist inside of another string. And so str1, I can say equal tilde, and say one. And what we're asking, does the sub string one exist inside of str1? And we can quickly see yes it does. We're going to do this again with string two. So, does the string one exist in string two? No, it doesn't.
Alright. So let's go over our first bit of regular expression syntax, and that is the dot also called the period. So that dot matches any one character. So we know that the word one exists in string one. but what about a different expression? So if I say str1 o.e Bool. And so what we're asking, does the sequence o followed by the character e exist in string one? Well we already know one, O-N-E, exists there, so yes this should evaluate to true.
We can do the same thing again with str2 o.e. Does that sequence exist anywhere in the string? And that also is true because the letters O-V-E exist inside of the string. I'm going to highlight ove, the match inside this string, and that's what's matching in string two in order to resolve this as true. The dot matches any single character.
The second regular expression character that we would like to introduce in this video is pipe. The pipe character is made using the vertical character or bar that appears over the Enter key on most keyboards. Many programming languages use the pipe to represent or, and regular expressions is no different. We can put a pipe between two expressions, and that means that either the first or the second expression is valid. So let's do a quick example. So str1 tilde. Does the word fish or fox appear in a strings? Fish, fox.
And we know the word fish appears several times in our first string. So yeah, this is going to be true. We could also do the same with str2 fish fox. We know that the word fox appears in our second string, so of course this also results in a true. We can look again at dog or cat in our first string, and that is false. And we can do the same with our second string.
So, dog or cat do not appear in either string, so both of these are going to result in false. So in this video, we installed the regular expression library and we looked at two symbols within the regular expression syntax, the dot and the pipe. The dot represents any one character, and the pipe means that any two expressions can be true. I didn't demonstrate this, but we can also chain multiple expressions with pipe so that any one of the expressions in the pipe chain can be true.
In our next video, we'll be looking at simple modifiers with regular expressions and understand what an atom is.
Note: This course was created by Packt Publishing. We are pleased to host this training in our library.
- Data ranges, means, and medians
- Standard deviation
- SQLite3 command line
- Slices of data
- Regular expressions
- Atoms and modifiers
- Character classes
- Line plots of a single variable
- Plotting a moving average
- Feature scaling
- Scatter plots
- Normal distribution
- Kernel density estimation (KDE)