Join Dan Sullivan for an in-depth discussion in this video Reformatting character data, part of Advanced SQL for Data Scientists.
- [Instructor] We often need to collect data from multiple sources. Sometimes, the same data is stored differently in different systems. For example, one database might use abbreviations for department names, while another database spells out the full name. We can reformat data to get it into a consistent format. Anytime you start working with a new data set, it is helpful to browse through the data to get a sense of how the data's formatted. So let's quickly look at how department names are stored in our database. Since we just need to see each name once, we'll use a distinct keyword. So we'll start with a SELECT statement, and we'll select department FROM staff, and as I'd mentioned, we want to use a DISTINCT keyword, so we'll put that in there.
There, now we'll see each department once, and as you can see, we have a set of department names appearing as they are stored in the database. If we wanted to reformat the department names to be in capitals or uppercase, we could use the UPPER keyword. And you'll notice the names are returned in uppercase. Similarly, if we wanted the department names in all lowercase, we'd use the LOWER keyword. Now, changing case is pretty simple. Sometimes, we might have data in two or more columns that we'd like to have in a single column.
For example, if we want to have a job title combined with a department name, we could concatenate two columns. Let's do that. Let's select job_title, and we'll use the concatenation operator, which is two pipes, and I want to put a dash in between the job title and the department name, so I'll concatenate that, and then I'll specify department, and we'll select this FROM staff. And what you'll notice is we have a job title followed by a dash, followed by the department name. Now, a nice feature of SQL is that was can give names to columns we create, such as when we reformat.
Let's call this new column Title_Department, and I do that by placing the alias, title_department, and I'll abbreviate that as D-E-P-T, after the concatenation string. Now, when I execute, you'll notice the title is changed to title_department. Depending on how data is stored in or extracted from other databases, you may find data has extra white spaces, like spaces and tabs. We can use the trim function to remove those extra characters.
So, for example, let's just create a column here, and we'll select the trim of a string value that has some spaces in there. Let's call it Software Engineer, and we have a couple spaces in the front, so let's add a couple spaces at the end, and let's execute that statement. Now what happens is, we return just the value of the string without leaving or trailing white spaces. Now, let's just verify that we've actually cut off those spaces. Let's first check the length of the string. So, with the extra spaces, we have 21 characters.
Now, let's check the length after applying the trim function. Now, this should reduce it by the number of white spaces, and it is. So we drop about four white spaces there. Reformatting can also entail adding new types of information, such as category columns that make it easier to select rows that meet some criteria. For example, let's list all employees with a title that begins with the word Assistant. And to do that, we'll create a new SELECT statement, and we'll select job_title, FROM staff, WHERE job title is like Assistant, and we'll use the wildcard, and we'll execute, and we'll see a number of titles have the word Assistant in the beginning.
Now, we might want to make it obvious in a results set if someone is an assistant, so we can create a new Boolean column called Is Assist which will be true or false, depending on whether someone is actually in an assistant. So, let's select job title and let's put in a Boolean expression here. Job_title like, and actually, we'll put a wildcard in front as well as after the term Assistant, that way, if someone has the word Assistant in their title, but it's not in the beginning, we can still catch it.
There. So, anytime a job title has the word Assistant in it, this Boolean will return true, and let's give it the name Is Assistant, which we'll abbreviate as Is_Asst, and we'll just execute it from the staff, and we'll do it for the entire staff, so we'll delete that where clause. Now what we find is that we have returned a value, Is_Asst, which is false when the word Assistant does not appear in the job title. But it is true, however, if, for example, we have Assistant Media Planner or Assistant Manager as the title.
So that's one way of adding new columns that can make it easier for other people to filter without having to understand all of the logic that might go behind, determining whether or not someone is an assistant.
The course begins with a brief overview of SQL. Then the five major topics a data scientist should understand when working with relational databases: basic statistics in SQL, data preparation in SQL, advanced filtering and data aggregation, window functions, and preparing data for use with analytics tools.
- Data manipulation
- ANSI standards
- SQL and variations
- Statistical functions in SQL
- String, numeric, and regular expression functions in SQL
- Advanced filtering techniques
- Advanced aggregation techniques
- Windowing functions for working with ordered data sets