Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member

Using repetition efficiently

From: Using Regular Expressions

Video: Using repetition efficiently

The use of repetition metacharacters has a big impact on the efficiency of the regex parsing, mostly because now the matched string can be of an indeterminate length. So the parser can assume that one part of the expression, like a character set, equals one character. Now it's going to have to move back and forth along the string, trying many different possibilities. Let's look at an example. Let's say that we just have a simple word character class that's repeated one or more times, followed by a literal character s. Our string is just going to be We picked apples.

Using repetition efficiently

The use of repetition metacharacters has a big impact on the efficiency of the regex parsing, mostly because now the matched string can be of an indeterminate length. So the parser can assume that one part of the expression, like a character set, equals one character. Now it's going to have to move back and forth along the string, trying many different possibilities. Let's look at an example. Let's say that we just have a simple word character class that's repeated one or more times, followed by a literal character s. Our string is just going to be We picked apples.

So the parser starts at the beginning, and it says, all right, I have got a W. That's a word character. That works fine. It goes to the e, so it's still a word character. Remember it's greedy, so it's still trying to consume each of these things in that first expression, the word character. Then it gets to the space, and it says okay, that's not a word character and that's not an s, so it fails both tests; therefore this didn't work out. At that point it backtracks to the e. Then it says, all right, let's try this possibility: what if this is the first expression? That first expression matches--it's a word character--so then it goes to the s. Still no s.

So now it looks again, and it says, all right, the space is neither a word character or an s, so then it moves to the p. All right, so let's watch it here. Just really make sure you understand it. It tries the p. It says, all right, that's a word character, the i is a word, the c is a word character, the k is a word character, the e and the d. When it finally gets to the space, it says, oops! This is not a word character. Now let's move to the second part of the expression, that's the s, it's not an s either. Now we backtrack to the i. We tried the p. We tried all those possibilities, so now we backtrack to the i-- very important. Now we try again.

So we say, all right, does this match? And it moves along the first part of the expression till it gets to the space, and it says this is not a word character, then checks and says, oops, it's not an s either, so now it backtracks back to the c. It starts the same thing over there again, till it gets to the space, and then it backtracks to the k, and so on. Now you may be able to just look at this and see, well, that's not a very efficient way to do it, but the regex parser doesn't necessarily know that. We have picked a very simple example here where that first expression is just a simple word character, and we can look at this whole thing globally with our eyes and see that there's a better approach to it.

But it can't do that. It can't have that kind of global view of the whole thing; it has to move down the line and has to process it bit by bit. Because the regular expressions can be complex, it can't make assumptions ahead of time about what will and won't work; it has to try them out. So then when it finally gets to the d and to the space, it realizes that that didn't work out and then it works its way to the a, and of course it moves its way along the a until it gets to the s, which works, and we have a match. What I want you to see there is that backtracking that it does--that's the important part. Let's now try another example. Let's look at what happens when we have the asterisk or star instead of the plus.

Let's go ahead and just fastforward to the work picked. I think it's easier to see what happens when we move there. Now like the plus, the star is also greedy by default, so it's going to want to consume as many word characters as it can. So it's going to start out and it's going to do the exact same thing that the previous version did. But here's the difference. It gets to the last character and says, okay, this is not a word character and it's not an s. Before, it backtracked to the i, right? This time it backtracks to the p. Why? Because remember, there is a second possibility for the asterisk, which is the possibility that it didn't match anything.

So it goes back and says, all right, what if this p was not anything? What if I don't consume that? What if instead I take the lazy strategy and have it be zero and then check the next character and see if that would be a s? It's not, so it failed. So now it doesn't need to backtrack. It's tried both its strategies on p: it's tried the greedy strategy and the lazy strategy. Now it says, all right, let's start with I. Is i a word character? Yes it is. It moves its way along and goes back and pretty much works the same way that the plus did.

But, it pops back to the i this time instead of the c. So it goes back one character further, because it has the second possibility, a possibility that it didn't match anything. Now you may be wondering, well, what if we used the lazy plus or the lazy star in these cases? Well, they don't do much to help. It does the same checking; it just reverses the priority in which it does the checking. Laziness on its own won't solve efficiency issues, unless it just happens that the data is structured in a way so that one strategy would find the result faster. Remember, I gave you the analogy of searching your house for lost car keys and sunglasses.

Well if that happens that your keys and sunglasses are both in the living room then obviously any strategy that starts searching in the living room is going to be the better strategy. But you won't necessarily know that ahead of time. So in a single instance, greediness or laziness may get you faster results, but as a global principle, you don't know that. Now you may think that efficiency doesn't matter that much, but you do need to be conscious of it. Now in a simple string like this with a simple regular expression like this, the computer can calculate this in a blink of an eye. But with a complicated regular expression and let's say a 2,000-page document, then an inefficient regular expression can take many, many times longer.

So let's talk about what we can do to make our regular expressions more efficient when we are working with repetition. The basic rule is that efficient matching plus less backtracking equals speedy results. So, ways that we can get better matching is to define the quantity of our repeated expressions. We just saw that using the plus is faster than using asterisk. Even better is if we know how many times something occurs, let's tell it. Let's give the regex engine that information because it will help it speed it along. Now you won't always know that, but if you were looking for a five-letter word, well then tell it, I'm looking for a five letter word.

Don't let it just guess it for itself. You can also narrow the scope of the repeated expression. So for example, instead of just looking for a wildcard that's repeated, instead, if you're looking for letters, tell it, I'm looking for letters. Capital letters, lowercase letters, that's what allowed here. That rules out a lot of other stuff and saves it a lot of time, because the matching is now much more concise, and that also reduces a lot of false matches. By narrowing our scope, that's also just a good principle in general, to make sure we find what we're looking for and not things that we are not looking for.

And the last thing is you can provide clearer starting and ending points. Let's say for example that you were looking for something that was inside angle brackets like an HTML tag. Well, you might first think, well, what I'm looking for is the less than sign, followed by any set of characters, until we get to a greater than sign. That's what an HTML tag looks like. But an even better way to do that would be to say, I'm looking for any character which is not an ending square bracket. I have more narrowly defined it to say, really anything here that's not the square bracket is okay.

Another good way to handle this is with something that we are going to learn in a later chapter, which is using anchors and word boundaries, that is, to tell it where the start and end of the line are and where the start and end of the word should be, and that will save you a lot of time as well. Let's take a look at these tips in light of the example that we just saw. So we just saw that, for example, if you had the star that it could be improved by using the plus instead. That's already going to be an efficiency improvement just by doing that. We could further improve the plus by more narrowly defining it. Now it might not be a big improvement here, but we could say, all right, not just any word character, but capital letters and lowercase letters.

That's what we are looking for. That rules out digits and underscores. It might save us a little bit of time. Even better is if we knew we were looking for a lowercase word, we could just say, okay, lowercase letters only, and that way, for example, in our example we could have skipped over the word we and it wouldn't have done that. Or if we knew that we were looking for words that were capitalized, well, tell it that. Say the first letter should be a capital and what's after that should be lowercase. Again, just trying to be as specific as possible. Ask yourself, what I am really searching for, and can I narrow the scope of my expression to get a faster match? And then as I was saying, you could search for whole words only.

So one way to search for whole words would be to use spaces. You could say, all right, I am looking for something that has a space at the beginning or a space at the end. The only problem with that is that then it includes the space in your match. It would be returned in your match. Not necessarily a problem, but just be aware that it would be there. We can use anchors and word boundaries to get around that problem later on. But by using the whole words, by telling it not to scan every single possibility, well then it would scan the word picked, but it would not scan icked, right? Because it would say, oh, that's not the start of the word.

So by being just more specific in saying, I'm looking for something that has a space in front of it, well, now it would scan for space picked, but then when I got to that p, it would say oops, that's not a space, move on. i, not a space, move on. And it would just keep working its way down the line. So in a six-letter word like picked you would cut down the amount of testing that it did by one sixth. Now I am not trying to give you a single recipe that's going to work for you in all cases. You are going to have to use your brainpower and figure out, how can I make this particular regular expression better? What I want to do through is instill in you a sense of the importance of looking at the efficiency of these regular expressions, especially now that we are using repetition, so that you can write good, concise, and efficient regular expressions.

Show transcript

This video is part of

Image for Using Regular Expressions
Using Regular Expressions

59 video lessons · 12449 viewers

Kevin Skoglund
Author

 
Expand all | Collapse all
  1. 2m 18s
    1. Welcome
      56s
    2. Using the exercise files
      1m 22s
  2. 19m 55s
    1. What are regular expressions?
      3m 20s
    2. The history of regular expressions
      6m 40s
    3. Regular expression engines
      2m 44s
    4. Installing an engine
      4m 5s
    5. Notation conventions and modes
      3m 6s
  3. 21m 23s
    1. Literal characters
      6m 39s
    2. Metacharacters
      2m 1s
    3. The wildcard metacharacter
      4m 31s
    4. Escaping metacharacters
      4m 53s
    5. Other special characters
      3m 19s
  4. 31m 26s
    1. Defining a character set
      5m 49s
    2. Character ranges
      4m 49s
    3. Negative character sets
      4m 53s
    4. Metacharacters inside character sets
      5m 12s
    5. Shorthand character sets
      6m 30s
    6. POSIX bracket expressions
      4m 13s
  5. 36m 38s
    1. Repetition metacharacters
      7m 17s
    2. Quantified repetition
      6m 59s
    3. Greedy expressions
      6m 27s
    4. Lazy expressions
      6m 46s
    5. Using repetition efficiently
      9m 9s
  6. 20m 24s
    1. Grouping metacharacters
      4m 14s
    2. Alternation metacharacter
      4m 54s
    3. Writing logical and efficient alternations
      7m 33s
    4. Repeating and nesting alternations
      3m 43s
  7. 19m 19s
    1. Start and end anchors
      7m 21s
    2. Line breaks and Multiline mode
      4m 41s
    3. Word boundaries
      7m 17s
  8. 23m 33s
    1. Backreferences
      8m 57s
    2. Backreferences to optional expressions
      3m 51s
    3. Finding and replacing using backreferences
      7m 16s
    4. Non-capturing group expressions
      3m 29s
  9. 32m 31s
    1. Positive lookahead assertions
      6m 39s
    2. Double-testing with lookahead assertions
      7m 16s
    3. Negative lookahead assertions
      6m 10s
    4. Lookbehind assertions
      6m 26s
    5. The power of positions
      6m 0s
  10. 13m 13s
    1. About Unicode
      4m 19s
    2. Unicode in regular expressions
      4m 41s
    3. Unicode wildcards and properties
      4m 13s
  11. 1h 55m
    1. How to use this chapter
      5m 38s
    2. Matching names
      6m 33s
    3. Matching postal codes
      8m 54s
    4. Matching email addresses
      5m 0s
    5. Matching URLs
      8m 1s
    6. Matching decimal numbers and currency
      6m 45s
    7. Matching IP addresses
      7m 10s
    8. Matching dates
      7m 49s
    9. Matching times
      8m 59s
    10. Matching HTML tags
      8m 34s
    11. Matching passwords
      6m 49s
    12. Matching credit card numbers
      9m 36s
    13. Finding words near other words
      6m 38s
    14. Formatting with Search and Replace, pt. 1
      7m 22s
    15. Formatting with Search and Replace, pt. 2
      4m 15s
    16. Formatting with Search and Replace, pt. 3
      7m 10s
  12. 47s
    1. Goodbye
      47s

Start learning today

Get unlimited access to all courses for just $25/month.

Become a member
Sometimes @lynda teaches me how to use a program and sometimes Lynda.com changes my life forever. @JosefShutter
@lynda lynda.com is an absolute life saver when it comes to learning todays software. Definitely recommend it! #higherlearning @Michael_Caraway
@lynda The best thing online! Your database of courses is great! To the mark and very helpful. Thanks! @ru22more
Got to create something yesterday I never thought I could do. #thanks @lynda @Ngventurella
I really do love @lynda as a learning platform. Never stop learning and developing, it’s probably our greatest gift as a species! @soundslikedavid
@lynda just subscribed to lynda.com all I can say its brilliant join now trust me @ButchSamurai
@lynda is an awesome resource. The membership is priceless if you take advantage of it. @diabetic_techie
One of the best decision I made this year. Buy a 1yr subscription to @lynda @cybercaptive
guys lynda.com (@lynda) is the best. So far I’ve learned Java, principles of OO programming, and now learning about MS project @lucasmitchell
Signed back up to @lynda dot com. I’ve missed it!! Proper geeking out right now! #timetolearn #geek @JayGodbold
Share a link to this course

What are exercise files?

Exercise files are the same files the author uses in the course. Save time by downloading the author's files instead of setting up your own files, and learn by following along with the instructor.

Can I take this course without the exercise files?

Yes! If you decide you would like the exercise files later, you can upgrade to a premium account any time.

Become a member Download sample files See plans and pricing

Please wait... please wait ...
Upgrade to get access to exercise files.

Exercise files video

How to use exercise files.

Learn by watching, listening, and doing, Exercise files are the same files the author uses in the course, so you can download them and follow along Premium memberships include access to all exercise files in the library.


Exercise files

Exercise files video

How to use exercise files.

For additional information on downloading and using exercise files, watch our instructional video or read the instructions in the FAQ .

This course includes free exercise files, so you can practice while you watch the course. To access all the exercise files in our library, become a Premium Member.

Are you sure you want to mark all the videos in this course as unwatched?

This will not affect your course history, your reports, or your certificates of completion for this course.


Mark all as unwatched Cancel

Congratulations

You have completed Using Regular Expressions.

Return to your organization's learning portal to continue training, or close this page.


OK
Become a member to add this course to a playlist

Join today and get unlimited access to the entire library of video courses—and create as many playlists as you like.

Get started

Already a member ?

Become a member to like this course.

Join today and get unlimited access to the entire library of video courses.

Get started

Already a member?

Exercise files

Learn by watching, listening, and doing! Exercise files are the same files the author uses in the course, so you can download them and follow along. Exercise files are available with all Premium memberships. Learn more

Get started

Already a Premium member?

Exercise files video

How to use exercise files.

Ask a question

Thanks for contacting us.
You’ll hear from our Customer Service team within 24 hours.

Please enter the text shown below:

The classic layout automatically defaults to the latest Flash Player.

To choose a different player, hold the cursor over your name at the top right of any lynda.com page and choose Site preferences from the dropdown menu.

Continue to classic layout Stay on new layout
Exercise files

Access exercise files from a button right under the course name.

Mark videos as unwatched

Remove icons showing you already watched videos if you want to start over.

Control your viewing experience

Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.

Interactive transcripts

Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.

Learn more, save more. Upgrade today!

Get our Annual Premium Membership at our best savings yet.

Upgrade to our Annual Premium Membership today and get even more value from your lynda.com subscription:

“In a way, I feel like you are rooting for me. Like you are really invested in my experience, and want me to get as much out of these courses as possible this is the best place to start on your journey to learning new material.”— Nadine H.

Thanks for signing up.

We’ll send you a confirmation email shortly.


Sign up and receive emails about lynda.com and our online training library:

Here’s our privacy policy with more details about how we handle your information.

Keep up with news, tips, and latest courses with emails from lynda.com.

Sign up and receive emails about lynda.com and our online training library:

Here’s our privacy policy with more details about how we handle your information.

   
submit Lightbox submit clicked
Terms and conditions of use

We've updated our terms and conditions (now called terms of service).Go
Review and accept our updated terms of service.