Content management systems often use headings to create permanent URLs, but this can create problems if the text contains accented characters. This technique shows how to convert accented characters and ligatures from Western European languages to their ASCII equivalents.
- [David] Hi, I'm David Powers and welcome to this week's edition of PHP tips, tricks, and techniques designed to help you become a smarter, more productive PHP developer. This week I'm going to look at how to strip accents from text. Now if you're wondering why on earth that might be of interest to you please bear with me for just a minute. Many European languages including French, Spanish, and German use accents and ligatures. I've recently visited Norway so I've chosen an example from Norwegian.
This means places to visit near Tromso and Kvaloya. The challenge is to strip the accents and convert the ligature into separate letters like this. But why do it? Well headings of posts in content management systems such as WordPress are frequently used as permalinks and removing the accents and ligatures makes URLs more user friendly. But even if you never use any language other than English I think you might find this PHP tip interesting because solutions aren't always immediately obvious.
PHP doesn't have a built-in function that'll do the job for you. So it requires thinking out of the box and this particular technique involves using some of the optional features of the htmlentities function that you might not have come across before. So with that explanation out of the way let's get to the PHP code. In the early days of the Web it was necessary to encode accented text using the htmlentities function like this.
In the browser you see exactly the same as you see here in the text but if you right-click and view the page source all of the accented characters and the ligature have been converted to their equivalent HTML character entities. The ampersand has also been converted. I'll come back to that a little later. The HTML character entities for accented characters begin with an ampersand immediately followed by a letter and then a description of the accent.
In the case of the ligature after the ampersand you've got the two characters from the ligature then L-I-G followed by the semi-colon. So that's how we solve the problem of stripping the accents, by converting them first to HTML character entities and then extracting the first one or two letters after the ampersand. But before doing so let's consider the ampersand in the original text. Modern content management systems support accented characters and ligatures without the need to use HTML character entities but they normally convert ampersands.
So from a content management system what we would normally see is this. So if I save that and then refresh this in the browser we end up with this, and amp. If we look at the page source what has happened is that leading ampersand has been converted and then it's followed by amp;. By default the htmlentities function encodes everything including the ampersand at the beginning of character entities.
And to prevent this double encoding we have to pass an optional argument to the htmlentities function. Unfortunately it's the fourth argument. So we need to pass two other optional arguments to the function as well. The first optional argument is a flag telling the function among other things how to handle quotes. Now because we're dealing with URLs we can leave in single quotes but not double quotes. So that argument needs to be a PHP constant which is ENT_COMPAT.
The next argument needs to be the encoding. We're using utf-8, so utf-8. And then the last argument to prevent that double encoding needs to be false. So if we save that and then run the script again now the ampersand is rendered correctly. So this line of script lays the foundation for removing the accents and ligatures. To save time I've completed the script in this other file, remove_accents.php, which you can find in the Download files for this video.
It defines a function called remove_accents and it takes two arguments. The first one is the string that we want to remove the accents from, and the second one is the character set which is set by default to utf-8 so it becomes an optional argument. On line four the string is encoded as HTML character entities. Then we make four passes over the string to extract the unaccented characters and remove unwanted entities.
I've commented out the last three so we can examine what's happening. The first pass on line six extracts the leading letter of accented characters using preg_replace. It uses a long regular expression but it's pretty straight forward. Let's examine it. It begins with the ampersand. Then in a pair of parentheses we've got a capturing group. And inside that capturing group is a character class which is looking for a single letter from A to Z.
And that's followed by this long non-capturing group which looks for alternative text. And these are in fact the descriptions of all the various accents, so acute accent, cedilla, a caron, circumflex, and so on. After that non-capturing group we're looking for the semi-colon. Then after the last delimiter we've got the I flag which makes it case insensitive. And the second argument of preg_replace is \1 and what this does is it replaces the matched expression, in other words the HTML character entity with the value that has been matched here which is the leading character.
Then down here on line 16 I've got that Norwegian expression again. I've added double quotes at both ends and then I'm passing it to remove accents. So if we load this page into a browser to run the script, there is the result. We've got rid of the accent over the A. We've also got rid of the slash in the O's here. But we've still got that ligature. And to deal with that ligature is the second pass.
It's here on line eight so let's remove those comments. We're doing pretty much the same thing here. We're using preg_replace with a regular expression and we're looking in this capturing group for two characters A to Z followed by lig and the semi-colon. And again it's case insensitive and we're simply replacing the matched value here. So if we save that and run the script again watch this ligature, that's gone.
So we're pretty much there but there's just one problem. This version of the text contains those double quotes which can't be used in a URL. The second argument to htmlentities means that they've been converted along with the accents and the ligature. And there might be other HTML character entities that don't match the regular expressions on lines six and eight. So if I uncomment line 10 we're using preg_replace again and this time we've got a negative look ahead.
So we're looking for a character entity that begins with an ampersand but doesn't contain A-M-P. So we're not going to look for the HTML character entity for the ampersand itself but we're looking for any characters A to Z or zero to nine. And the replacement, the second argument, is an empty string. So this will match any HTML character entity except the one for an ampersand and replace it with an empty string. So if we refresh the browser we've got rid of those double quotes but we've kept the ampersand.
And there's just one final pass needed if you're going to use the resulting text in a URL and that's here on line 12. If we remove the comments there we're using string replace. We're looking for a space and we're replacing it with a hyphen. So if we save that and refresh the browser all those spaces are replaced by hyphens. And this solution will work for most if not all west European languages. And even if you never deal with foreign languages I hope that this has inspired you to think around a problem rather than being fixated on trying to find a literal solution.
Well, that's it for this week's PHP tips, tricks, and techniques. I hope you found it interesting. Thanks for watching.
Note: The exercise files are free to all members. The code is commented to enhance your learning, but you will need database connectivity for some files to run as intended.