Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member

Matching URLs

From: Using Regular Expressions

Video: Matching URLs

In this movie, let's take a look at how to write a regular expression that would match URLs. To begin with, let's get an idea of what kind of sample data we are working with. What can a URL be? Remember in the introductory movie to this chapter, I told you that that's the best first step, is to really get an idea of what you want to match. So some good sample URLs; a very basic one, of course, would just be http://www.nowhere.com. The http portion is referred to as the protocol. The www is known as a subdomain, and then the nowhere.com is known as the domain.

Matching URLs

In this movie, let's take a look at how to write a regular expression that would match URLs. To begin with, let's get an idea of what kind of sample data we are working with. What can a URL be? Remember in the introductory movie to this chapter, I told you that that's the best first step, is to really get an idea of what you want to match. So some good sample URLs; a very basic one, of course, would just be http://www.nowhere.com. The http portion is referred to as the protocol. The www is known as a subdomain, and then the nowhere.com is known as the domain.

It's also valid to have a URL that doesn't have a subdomain; just to omit the www, dot. And it's also valid to have a different subdomain, like blog.nowhere.com. You can also have a different protocol. In addition to http, there is https, and that's the secure http protocol. You would use that with sites that use credit cards, or bank transactions; things like that. Now, all of these get us to the right server, but then once we are on the server, of course, we are going to start surfing around, and we are going to start hitting different pages. So the URL can also have the page information after it.

So, product_page.html. And, of course, those can be nested inside other folders, and the file endings might also include things like images. It's entirely possible that you would just have a URL that doesn't have an actual file ending. You might know that that would either use index.html by default, or a Web application might use it to do something else more complex. You can also append numbers in there, or you can use a query string at the end. So for example, product_page.php?; the question mark indicates that you've got a query string at the end, and then product=28.

That query string can contain a lot of different stuff. It can contain ampersands if we have more than one, it can contain square brackets, and percent signs. There are all sorts of characters that are allowed to be in there. You will also notice in this last example that we don't even have that trailing slash after .com anymore. Now it just goes straight to the query string: .com? So we won't want to take that trailing slash for granted. It's not always there. And in fact, we can actually use an anchor as well, which would be the hash or pound sign, and that would tell it what part of the page we wanted to go to after it got there.

Okay, so now that we have this sample data, I think we are ready to start thinking about how we want to match this. And I am looking at it; I am seeing several different parts. I am seeing that there's really the protocol portion, and that always comes before the ://. Then there is the domain portion, which may or may not include a subdomain. And then, once we get to the right server, there is the instructions for that server, and that can made up of a page, and a query string, just a page, just a query string, or something else, like an anchor. Let's try writing it by focusing on those three blocks: the protocol, the domain, and the location and query string portion.

To start with the domain, we know that we want to use an anchor tag at the beginning, and we want to use multi-line anchors, because we are actually checking many of these at one time. So let's go ahead and just start with our first one, and let's match the beginning of the line against http://. So that works; that gets most of our cases. It doesn't allow for the special case where we have the secure connection, though. A couple of ways we can do this. We can either put S, and then question mark for making it optional. Now that matches. Or, a lot of people prefer to do it this way instead; to do open parentheses, http, and then make it an option.

Why do it one way over the other? Well, you may choose the first one, because it's more concise. The nice thing about this one is it gives us more flexibility if we want to add other protocols later, like FTP. So I am going to go with this one. It's also a good idea for us to escape these forward slashes. Now, we don't have to do it here; it doesn't break it in the JavaScript context. But if this had been a regular expression inside a programming language, it might have had surrounding forward slashes around it, and then we would need to escape it. So I am going to go ahead and add those in there. Now we've successfully matched the protocol portion.

Depending on the context, you could, of course, make this whole protocol portion optional, because you can have a valid URL that's just nowhere.com, and that would just assume that the protocol was http://. But again, that depends on your context. If you wanted it optional, you know how you could turn it to an optional group. Now let's move on to the domain portion, which may or may not have a subdomain. The domain is very similar to what we saw with the e-mail addresses. So we could go ahead and use the same regular expression that we used to match the domain there. And if you want to refer back there, you'll remember how we talked about how this isn't specific to make sure that the domain actually exists.

But in this case, I think it's okay, because you know what? We don't actually know that the page exists after that anyway. A URL isn't necessarily valid until you test it, and see whether it actually exists. So I don't see it as a big problem if we don't know if the domain is valid, because we have no way of knowing if the URL is actually valid. So in this case, I think that the broader version is going to be better. But there is a problem here. While this does allow for domains and subdomains just fine, there is also another case that we haven't considered, which is where we could use an IP address. So one of the edge cases might be that we have 255.255.255.255.

Now, we will talk more about how to match IP addresses in a moment, but for now, I just want to make sure that we allow for something that will at least accommodate this basic usage of an IP address. One way that we can do that is to modify our regular expression so that we say, instead of this having to be letters here, this allows for numbers already, with the w for word character. So instead of restricting this last portion to being only letters, uppercase and lowercase, we need it to be numbers as well. So in order to do that, I am just going to change this back to being the backslash, W.

That now allows for that IP address. I am going to go ahead and remove this restriction as well, and just make it a simple plus. And then I could actually put this whole thing in parentheses, and say that there may be portions that repeat there. So what I'm saying is, there's some portion, and then after that there's a dot, and some more, and then a dot, and some more, and a dot, and some more, and a dot, and some more. I am not caring what those things are; they are just word characters. We also should allow this to be a hyphen inside here. Let's do that; there we go. So now we are matching all possibilities where we just have something, followed by a dot, followed by something, followed by a dot, followed by something, and so on.

And I won't try and get more specific. Again, if you want to, you can refer back to the movie about matching e-mails, or the upcoming movie about matching IP addresses. So now what about the page and query string portion? Well, the simplest approach is just to say there can be any character, using the wildcard, then zero more times, and then the end of the string, and that will match anything. And that's because there are so many possibilities for what could be there, and we don't want to try and rule them all out. So there could be semicolons, there could be question marks; there are all sorts of possibilities. Now, if you wanted to, you could be more specific, and you could say, alright, you know what? That first character that goes right here, separating it; that needs to be either a forward slash, a pound sign, or a question mark, and you can limit it to just those.

Of course, we need to make that optional. But once we've done that, I mean, we've made an optional group that says that it's one of those characters, but then right after that, we've got a wildcard. So if it doesn't match one of those characters, it's meaningless, because it will just say, okay, I guess it wasn't there, and the wildcard will still match everything. So the only way this becomes meaningful is if we start making the wildcard more specific. And anything we put here in the wildcard; it's probably going to include those characters anyway. So, I don't think it's actually that meaningful to try and figure out what that dividing character might be.

We can tighten up this wildcard just a little bit by putting in some of the most common characters that you would see in there, and then you could add to it if you think of more. For example, I notice that the semicolon is not in there. We can add in a semicolon as well. Then, of course, that character set needs to be repeated to be able to match everything. So this is the broadest possible match, and then you can tighten it up to fit your individual needs. The very last thing that I recommend you do when you are working with these regular expressions is look for groups that are being captured. If you're not using capturing, then turn them into non-capturing groups, like this. This group is a non-capturing group, and there we go. Those are the two, and now those are non-capturing groups, so we've made our regular expression a little more efficient.

Show transcript

This video is part of

Image for Using Regular Expressions
Using Regular Expressions

59 video lessons · 12503 viewers

Kevin Skoglund
Author

 
Expand all | Collapse all
  1. 2m 18s
    1. Welcome
      56s
    2. Using the exercise files
      1m 22s
  2. 19m 55s
    1. What are regular expressions?
      3m 20s
    2. The history of regular expressions
      6m 40s
    3. Regular expression engines
      2m 44s
    4. Installing an engine
      4m 5s
    5. Notation conventions and modes
      3m 6s
  3. 21m 23s
    1. Literal characters
      6m 39s
    2. Metacharacters
      2m 1s
    3. The wildcard metacharacter
      4m 31s
    4. Escaping metacharacters
      4m 53s
    5. Other special characters
      3m 19s
  4. 31m 26s
    1. Defining a character set
      5m 49s
    2. Character ranges
      4m 49s
    3. Negative character sets
      4m 53s
    4. Metacharacters inside character sets
      5m 12s
    5. Shorthand character sets
      6m 30s
    6. POSIX bracket expressions
      4m 13s
  5. 36m 38s
    1. Repetition metacharacters
      7m 17s
    2. Quantified repetition
      6m 59s
    3. Greedy expressions
      6m 27s
    4. Lazy expressions
      6m 46s
    5. Using repetition efficiently
      9m 9s
  6. 20m 24s
    1. Grouping metacharacters
      4m 14s
    2. Alternation metacharacter
      4m 54s
    3. Writing logical and efficient alternations
      7m 33s
    4. Repeating and nesting alternations
      3m 43s
  7. 19m 19s
    1. Start and end anchors
      7m 21s
    2. Line breaks and Multiline mode
      4m 41s
    3. Word boundaries
      7m 17s
  8. 23m 33s
    1. Backreferences
      8m 57s
    2. Backreferences to optional expressions
      3m 51s
    3. Finding and replacing using backreferences
      7m 16s
    4. Non-capturing group expressions
      3m 29s
  9. 32m 31s
    1. Positive lookahead assertions
      6m 39s
    2. Double-testing with lookahead assertions
      7m 16s
    3. Negative lookahead assertions
      6m 10s
    4. Lookbehind assertions
      6m 26s
    5. The power of positions
      6m 0s
  10. 13m 13s
    1. About Unicode
      4m 19s
    2. Unicode in regular expressions
      4m 41s
    3. Unicode wildcards and properties
      4m 13s
  11. 1h 55m
    1. How to use this chapter
      5m 38s
    2. Matching names
      6m 33s
    3. Matching postal codes
      8m 54s
    4. Matching email addresses
      5m 0s
    5. Matching URLs
      8m 1s
    6. Matching decimal numbers and currency
      6m 45s
    7. Matching IP addresses
      7m 10s
    8. Matching dates
      7m 49s
    9. Matching times
      8m 59s
    10. Matching HTML tags
      8m 34s
    11. Matching passwords
      6m 49s
    12. Matching credit card numbers
      9m 36s
    13. Finding words near other words
      6m 38s
    14. Formatting with Search and Replace, pt. 1
      7m 22s
    15. Formatting with Search and Replace, pt. 2
      4m 15s
    16. Formatting with Search and Replace, pt. 3
      7m 10s
  12. 47s
    1. Goodbye
      47s

Start learning today

Get unlimited access to all courses for just $25/month.

Become a member
Sometimes @lynda teaches me how to use a program and sometimes Lynda.com changes my life forever. @JosefShutter
@lynda lynda.com is an absolute life saver when it comes to learning todays software. Definitely recommend it! #higherlearning @Michael_Caraway
@lynda The best thing online! Your database of courses is great! To the mark and very helpful. Thanks! @ru22more
Got to create something yesterday I never thought I could do. #thanks @lynda @Ngventurella
I really do love @lynda as a learning platform. Never stop learning and developing, it’s probably our greatest gift as a species! @soundslikedavid
@lynda just subscribed to lynda.com all I can say its brilliant join now trust me @ButchSamurai
@lynda is an awesome resource. The membership is priceless if you take advantage of it. @diabetic_techie
One of the best decision I made this year. Buy a 1yr subscription to @lynda @cybercaptive
guys lynda.com (@lynda) is the best. So far I’ve learned Java, principles of OO programming, and now learning about MS project @lucasmitchell
Signed back up to @lynda dot com. I’ve missed it!! Proper geeking out right now! #timetolearn #geek @JayGodbold
Share a link to this course

What are exercise files?

Exercise files are the same files the author uses in the course. Save time by downloading the author's files instead of setting up your own files, and learn by following along with the instructor.

Can I take this course without the exercise files?

Yes! If you decide you would like the exercise files later, you can upgrade to a premium account any time.

Become a member Download sample files See plans and pricing

Please wait... please wait ...
Upgrade to get access to exercise files.

Exercise files video

How to use exercise files.

Learn by watching, listening, and doing, Exercise files are the same files the author uses in the course, so you can download them and follow along Premium memberships include access to all exercise files in the library.


Exercise files

Exercise files video

How to use exercise files.

For additional information on downloading and using exercise files, watch our instructional video or read the instructions in the FAQ .

This course includes free exercise files, so you can practice while you watch the course. To access all the exercise files in our library, become a Premium Member.

Are you sure you want to mark all the videos in this course as unwatched?

This will not affect your course history, your reports, or your certificates of completion for this course.


Mark all as unwatched Cancel

Congratulations

You have completed Using Regular Expressions.

Return to your organization's learning portal to continue training, or close this page.


OK
Become a member to add this course to a playlist

Join today and get unlimited access to the entire library of video courses—and create as many playlists as you like.

Get started

Already a member ?

Become a member to like this course.

Join today and get unlimited access to the entire library of video courses.

Get started

Already a member?

Exercise files

Learn by watching, listening, and doing! Exercise files are the same files the author uses in the course, so you can download them and follow along. Exercise files are available with all Premium memberships. Learn more

Get started

Already a Premium member?

Exercise files video

How to use exercise files.

Ask a question

Thanks for contacting us.
You’ll hear from our Customer Service team within 24 hours.

Please enter the text shown below:

The classic layout automatically defaults to the latest Flash Player.

To choose a different player, hold the cursor over your name at the top right of any lynda.com page and choose Site preferences from the dropdown menu.

Continue to classic layout Stay on new layout
Exercise files

Access exercise files from a button right under the course name.

Mark videos as unwatched

Remove icons showing you already watched videos if you want to start over.

Control your viewing experience

Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.

Interactive transcripts

Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.

Learn more, save more. Upgrade today!

Get our Annual Premium Membership at our best savings yet.

Upgrade to our Annual Premium Membership today and get even more value from your lynda.com subscription:

“In a way, I feel like you are rooting for me. Like you are really invested in my experience, and want me to get as much out of these courses as possible this is the best place to start on your journey to learning new material.”— Nadine H.

Thanks for signing up.

We’ll send you a confirmation email shortly.


Sign up and receive emails about lynda.com and our online training library:

Here’s our privacy policy with more details about how we handle your information.

Keep up with news, tips, and latest courses with emails from lynda.com.

Sign up and receive emails about lynda.com and our online training library:

Here’s our privacy policy with more details about how we handle your information.

   
submit Lightbox submit clicked
Terms and conditions of use

We've updated our terms and conditions (now called terms of service).Go
Review and accept our updated terms of service.