Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member

Matching HTML tags

From: Using Regular Expressions

Video: Matching HTML tags

In this movie, we'll work on developing regular expressions that can match HTML tags. HTML tags typically look something like this. We've got an opening tag, which is made up of a less than sign, and a greater than sign, on either side of a word. You can also think of those as being angle brackets, and that's the opening tag. Then there is some text that's in the middle, and then after that is the closing tag, which is just like the opening tag, but with a forward slash right before the word, and that's what an HTML tag typically looks like. The content to that tag can vary, depending on its purpose, so we can have a strong tag; we can also have an em tag to Emphasize something.

Matching HTML tags

In this movie, we'll work on developing regular expressions that can match HTML tags. HTML tags typically look something like this. We've got an opening tag, which is made up of a less than sign, and a greater than sign, on either side of a word. You can also think of those as being angle brackets, and that's the opening tag. Then there is some text that's in the middle, and then after that is the closing tag, which is just like the opening tag, but with a forward slash right before the word, and that's what an HTML tag typically looks like. The content to that tag can vary, depending on its purpose, so we can have a strong tag; we can also have an em tag to Emphasize something.

We could have b for Bold, or i for Italics, and just have a single character there. Now incidentally, bold and italics have been deprecated in favor of using strong and em in their place. So it's a very common task with regular expressions to scan across an entire Web site, and replace all of the bold tags with strong tags, or all of the i tags with em tags. So it's a perfect case where you might find yourself using regular expressions to do it. Now, a single word is not the only thing that can go in that first tag. We can also have attributes, and the way we have an attributes is we have a space, and then we have the attribute name, an equal sign, and then in quotation marks, if it's properly formatted -- you may see it without quotation marks if it's not properly formatted -- then we have the value for that attribute, and then a space, and the next attribute value pair. And there can be many of those: id, class, style; there are lots and lots of different possibilities.

And last of all, let's don't forget that there are some tags that can be self-closing tags; that is, they don't have any content in the middle, and they don't actually have a closing tag. They close themselves by just having a space, followed by the forward slash, and then that angle bracket at the end. HR, for horizontal rule, is a good example of that. Let's try writing some regular expressions that'll match these. So the first thing we want to do is turn on multi-line anchors, like we have been doing in the other movies, just to make sure that we match each line individually. Now, let just try matching that first one. A lot of times the way I find that it's easiest to match is just to copy that value, and paste it up there.

Obviously, that's going to match, because it's a literal word for word string, and then we can start playing with it, and see if we can adapt that a bit. So for example, here in the middle, we really don't care with this text is in the middle, so we can just say that that's going to be any character, followed by an asterisk, and we'll make it not greedy. Now, I'm going to put parentheses around it, just to sort of keep it separate from everything else that's we're doing, but you don't have to. Okay, so that still matches our example here, but it doesn't match any of our other ones. We need to add more flexibility in here, so instead of just having strong here, let's put parentheses around this, and let's make it into an alternation. Let's say they can also be em, for example. We can do the same thing here at the end, and we'll make this also equal to em, parentheses.

So now we've matched the first two tags, but we've actually introduced a problem that we may not have realized, which is that we could also have strong, mistake, and then, em, and that's a perfectly valid tag. Our tags are no longer balanced, and we're still getting a match. We want to make sure that we do have balanced tags, especially because it is possible to nest one tag inside another. We don't want to mistakenly grab the wrong tag. That's why we also use the lazy operator here after our star, was to make sure that we didn't consume too much. We grabbed the next tag that matches.

So one way we can do that is by using backreferences; we talked about those before. Backreference here; we are already capturing this group here. We've got parentheses around it, so it's being captured for us. Backslash 1; we'll now grab whatever value is found there, and reuse it again, as if we were copying it and pasting it into this spot. So now it matches strong and strong, and em and em, but not the weird combination where we had strong, and em after it. Okay, so what about our other ones down here? Bold, italics; we could just keep going. We could just keep listing all the tags that we wanted to define here, and certainly if we were trying to find certain tags, then that would make sense. We would want to itemize the tags. But there is a lot of HTML tags that are out there.

What I want to write is something that's more general; something that will match any HTML tag. So we know we could do that by just changing this to be a wildcard, make it plus, we could even make it lazy, just like we had before, and that'll make sure it doesn't grab too much code, and that works. That actually does match our first four. But there's an even better way that we can approach this. Instead of just having this wildcard, which always feels a little bit sloppy to me; you always want to be careful with that wildcard. Instead, I'm going to change this definition here to say, well you know what? Actually, this can be any character, but it can't be that closing angle bracket.

As long as we haven't hit a closing angle bracket yet, go ahead and keep using it. It's a little bit faster, because it does ensure every time, have we got to the closing angle bracket yet? No. Have we got to the closing angle bracket yet? No. And then when we finally do hit one, then it says, okay, we finally are at the closing angle bracket; we're ready to move on to the next match. And I don't allow this to be a lazy quantifier any more, because I've been a little bit more specific here about what I'm looking for. Now, notice that it's not matching the next one. Why is that? Well, it's because when we have these attributes in there, they break our backreference.

Now this whole bit right here is being captured, and being used in the backreference. So if this also had id and class after it, well then we might get a match, but we don't. What we need to do is separate, then, this first word from any attributes that might appear after it. So one good way to do that is say, alright, what could this first word be? Let's go ahead and move our capture to that. Let's don't try and capture this. We're still going to use this for anything that might come after it, but for this first word -- the part that we actually want to capture -- we're going to instead say, well that can be any letter, A to Z, and that does indeed have to be a letter, and then followed by a second character, which can be A to Z, or lowercase a to z, or 0 to 9. It's essentially any word character without the underscore. That's because there is a tag called H2.

That second one can occur zero or more times, and then after that, just to make sure that we've actually ended that word, let's put that there should be a word boundary there as well. So once we get to the word boundary, we'll know we've got the first word, and we can capture it, and then we've got our second part here that takes care of all of those attributes, and then our center section, followed by our tag at the end. So now you can see that our backreference is working, but in the process, we broke it for all of these up here. That's because they don't have these attributes up here, so we have to say it's zero or more times.

It's a possibility that there are no characters after the word boundary. The very next thing after the word boundary might be this angle bracket. So finally now, we got it matching all five of these. Alright, last of all, let's look at this possibility for a self-closing tag. Now, you certainly could just search for that separately, or you might want put in an alternation, let's say, inside this tag. Let's put it right here. All the way to this -- we will do it right here; we'll make that the alternation. So this one could be hr, slash, with a closing angle bracket, and now it matches as well.

Now, let's revise this a little bit. Let's don't be so specific about hr, because there are other possibilities that can be there as well. What we can actually do is just grab all of this here from this beginning section, and let's copy it. There we are, and then it actually can have attributes too. We don't need to capture this time; we don't need to do a backreference, so let's just grab this a bit here that may or may not exist before we get to the end. Okay, so now we have the possibility of having attributes on our hr tag. But what happened to our other matches? Why did they stop working in this process, right? There's a simple alternation here; why did this suddenly stop working? And if you want to try it, take this back out, just this alternation, and you'll see that it does work, and then when you put it back in, it stops working again. Can you guess why? It's because we have this backreference here referring to capture number one.

What is capture number one? What's the first thing being captured? It's these parentheses that's grabbing the entire match. That's what's being referenced. So what we need to do is put a non- capturing group here at the beginning. Now it says, alright, don't capture this one, this is just an alternation. This is my first capture again, and so that what the backreference now refers to. Be careful about that when you start adding in these parentheses. In fact, it's not a bad idea to go ahead and make these non-capturing as well, unless we're actually trying to capture it.

So now we developed one regular expression that will match all of our HTML tags.

Show transcript

This video is part of

Image for Using Regular Expressions
Using Regular Expressions

59 video lessons · 12490 viewers

Kevin Skoglund
Author

 
Expand all | Collapse all
  1. 2m 18s
    1. Welcome
      56s
    2. Using the exercise files
      1m 22s
  2. 19m 55s
    1. What are regular expressions?
      3m 20s
    2. The history of regular expressions
      6m 40s
    3. Regular expression engines
      2m 44s
    4. Installing an engine
      4m 5s
    5. Notation conventions and modes
      3m 6s
  3. 21m 23s
    1. Literal characters
      6m 39s
    2. Metacharacters
      2m 1s
    3. The wildcard metacharacter
      4m 31s
    4. Escaping metacharacters
      4m 53s
    5. Other special characters
      3m 19s
  4. 31m 26s
    1. Defining a character set
      5m 49s
    2. Character ranges
      4m 49s
    3. Negative character sets
      4m 53s
    4. Metacharacters inside character sets
      5m 12s
    5. Shorthand character sets
      6m 30s
    6. POSIX bracket expressions
      4m 13s
  5. 36m 38s
    1. Repetition metacharacters
      7m 17s
    2. Quantified repetition
      6m 59s
    3. Greedy expressions
      6m 27s
    4. Lazy expressions
      6m 46s
    5. Using repetition efficiently
      9m 9s
  6. 20m 24s
    1. Grouping metacharacters
      4m 14s
    2. Alternation metacharacter
      4m 54s
    3. Writing logical and efficient alternations
      7m 33s
    4. Repeating and nesting alternations
      3m 43s
  7. 19m 19s
    1. Start and end anchors
      7m 21s
    2. Line breaks and Multiline mode
      4m 41s
    3. Word boundaries
      7m 17s
  8. 23m 33s
    1. Backreferences
      8m 57s
    2. Backreferences to optional expressions
      3m 51s
    3. Finding and replacing using backreferences
      7m 16s
    4. Non-capturing group expressions
      3m 29s
  9. 32m 31s
    1. Positive lookahead assertions
      6m 39s
    2. Double-testing with lookahead assertions
      7m 16s
    3. Negative lookahead assertions
      6m 10s
    4. Lookbehind assertions
      6m 26s
    5. The power of positions
      6m 0s
  10. 13m 13s
    1. About Unicode
      4m 19s
    2. Unicode in regular expressions
      4m 41s
    3. Unicode wildcards and properties
      4m 13s
  11. 1h 55m
    1. How to use this chapter
      5m 38s
    2. Matching names
      6m 33s
    3. Matching postal codes
      8m 54s
    4. Matching email addresses
      5m 0s
    5. Matching URLs
      8m 1s
    6. Matching decimal numbers and currency
      6m 45s
    7. Matching IP addresses
      7m 10s
    8. Matching dates
      7m 49s
    9. Matching times
      8m 59s
    10. Matching HTML tags
      8m 34s
    11. Matching passwords
      6m 49s
    12. Matching credit card numbers
      9m 36s
    13. Finding words near other words
      6m 38s
    14. Formatting with Search and Replace, pt. 1
      7m 22s
    15. Formatting with Search and Replace, pt. 2
      4m 15s
    16. Formatting with Search and Replace, pt. 3
      7m 10s
  12. 47s
    1. Goodbye
      47s

Start learning today

Get unlimited access to all courses for just $25/month.

Become a member
Sometimes @lynda teaches me how to use a program and sometimes Lynda.com changes my life forever. @JosefShutter
@lynda lynda.com is an absolute life saver when it comes to learning todays software. Definitely recommend it! #higherlearning @Michael_Caraway
@lynda The best thing online! Your database of courses is great! To the mark and very helpful. Thanks! @ru22more
Got to create something yesterday I never thought I could do. #thanks @lynda @Ngventurella
I really do love @lynda as a learning platform. Never stop learning and developing, it’s probably our greatest gift as a species! @soundslikedavid
@lynda just subscribed to lynda.com all I can say its brilliant join now trust me @ButchSamurai
@lynda is an awesome resource. The membership is priceless if you take advantage of it. @diabetic_techie
One of the best decision I made this year. Buy a 1yr subscription to @lynda @cybercaptive
guys lynda.com (@lynda) is the best. So far I’ve learned Java, principles of OO programming, and now learning about MS project @lucasmitchell
Signed back up to @lynda dot com. I’ve missed it!! Proper geeking out right now! #timetolearn #geek @JayGodbold
Share a link to this course

What are exercise files?

Exercise files are the same files the author uses in the course. Save time by downloading the author's files instead of setting up your own files, and learn by following along with the instructor.

Can I take this course without the exercise files?

Yes! If you decide you would like the exercise files later, you can upgrade to a premium account any time.

Become a member Download sample files See plans and pricing

Please wait... please wait ...
Upgrade to get access to exercise files.

Exercise files video

How to use exercise files.

Learn by watching, listening, and doing, Exercise files are the same files the author uses in the course, so you can download them and follow along Premium memberships include access to all exercise files in the library.


Exercise files

Exercise files video

How to use exercise files.

For additional information on downloading and using exercise files, watch our instructional video or read the instructions in the FAQ .

This course includes free exercise files, so you can practice while you watch the course. To access all the exercise files in our library, become a Premium Member.

Are you sure you want to mark all the videos in this course as unwatched?

This will not affect your course history, your reports, or your certificates of completion for this course.


Mark all as unwatched Cancel

Congratulations

You have completed Using Regular Expressions.

Return to your organization's learning portal to continue training, or close this page.


OK
Become a member to add this course to a playlist

Join today and get unlimited access to the entire library of video courses—and create as many playlists as you like.

Get started

Already a member ?

Become a member to like this course.

Join today and get unlimited access to the entire library of video courses.

Get started

Already a member?

Exercise files

Learn by watching, listening, and doing! Exercise files are the same files the author uses in the course, so you can download them and follow along. Exercise files are available with all Premium memberships. Learn more

Get started

Already a Premium member?

Exercise files video

How to use exercise files.

Ask a question

Thanks for contacting us.
You’ll hear from our Customer Service team within 24 hours.

Please enter the text shown below:

The classic layout automatically defaults to the latest Flash Player.

To choose a different player, hold the cursor over your name at the top right of any lynda.com page and choose Site preferences from the dropdown menu.

Continue to classic layout Stay on new layout
Exercise files

Access exercise files from a button right under the course name.

Mark videos as unwatched

Remove icons showing you already watched videos if you want to start over.

Control your viewing experience

Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.

Interactive transcripts

Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.

Learn more, save more. Upgrade today!

Get our Annual Premium Membership at our best savings yet.

Upgrade to our Annual Premium Membership today and get even more value from your lynda.com subscription:

“In a way, I feel like you are rooting for me. Like you are really invested in my experience, and want me to get as much out of these courses as possible this is the best place to start on your journey to learning new material.”— Nadine H.

Thanks for signing up.

We’ll send you a confirmation email shortly.


Sign up and receive emails about lynda.com and our online training library:

Here’s our privacy policy with more details about how we handle your information.

Keep up with news, tips, and latest courses with emails from lynda.com.

Sign up and receive emails about lynda.com and our online training library:

Here’s our privacy policy with more details about how we handle your information.

   
submit Lightbox submit clicked
Terms and conditions of use

We've updated our terms and conditions (now called terms of service).Go
Review and accept our updated terms of service.