Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member
Database normalization is the process of taking your database design through a set of rules called normal forms. So that it conforms to relational database standards and you really want to do this, so that your database will contain a minimum of duplicate data or redundant data. It'll contain data that's easy to get to define to edit and maintain, and that you can perform operations even difficult ones on your database without creating garbage inside, without invalidating the state of it.
It should be carried out for every database you design, and it's really not that hard, even though yes, when you first start reading about database normalization, you're likely to run into phrases like "your database won't be in third normal form until every non- prime attribute of R is non-transitively dependent (i.e. directly dependent) on every candidate key of R," but you don't have to get into all this language. You just have to understand these were a set of rules developed about 40 years ago by E. F. Codd, the father of databases, and we step through them basically one, two, three, first normal form, second normal form, third normal form.
So what's first normal form? Well, it starts off with stuff we've been doing already. Your data needs to have a unique key. It should always have a unique key. There are a few very rare situations in which you don't have a unique and primary key, but we're going to have one for all our databases. Really, the key for first normal form is that each of your columns, each of your fields, should contain one value and just one value and there should be no repeating groups. Okay, what does this mean with actually our tables? Well, let's say, for example, I begin developing a customer table.
I've got a customer ID, so that's good. We've started off for first normal form. I've got the name of the customer and the city they're based in. Then what I decide to do is say that all our customers have a representative, the person we talk to. So add another column to the table. This would be the customer contact. Who do we speak to at ACME Corp? Who do we speak to at Two Trees or Acacia? The issue is what happens when one of these companies starts to grow a little bit, and we find out that we've got more than one contact. Well, there is a couple of ways you could deal with it.
You could just start stuffing extra data into that one column. So we could just start putting commas or any other delimiter and putting multiple values in the one Contact column. Well this is a no-no. This is not in first normal form if you do this because first normal form demands that every column, every field, contains one and only one value. If you decide to show multiple values in like this, you'll find it harder to search, you'll find it harder to sort, you'll find it harder to maintain.
Well, what some people do is they rip it out that way, go back to the original one, and then they start adding more columns, Contact, Contact 2, Contact 3. This is what's called a repeating group and there should be no repeating groups. The classic sign of a repeating group is fields with the same name and different numbers. We don't want either of these things. So what do we do? What we do is what we do for a lot of the normalization steps. We rip the Contact data out and create our own customerContact table.
This then has relationships. We go to many-to-one relationship between customer and customerContact, where we can go from customer 1, find the contact, go from customer 2, find the contact, go from customer 3 and find the three contacts, and this would get it in first normal form. That's step one, because to go onto second normal form, well, first you have to be in first normal form. You don't pick and choose. You go through this one, two, three. Second normal form has the rather puzzling phrase that "any non-key field should be dependent on the entire primary key, " and that's about as simple as it can get phrased.
Now what does this actually mean? Well, for most of what we've done in this course, this isn't an issue for us. We're already in second normal form. Let me show you a table that currently is in first normal form, but not in second normal form. I have an events table here that has an ID of a course and a Date and a CourseTitle. Now what's actually happening is this table has been defined so that it's using two columns as the key to it. This is what's referred to as a compound primary key.
Instead of just one ID column, which I can't use the ID here, because as you see SQL101 appears multiple times, but I can combine the ID with the Date and in a lot of cases this makes sense. The issue is if you do this and use a compound key, you need to look at the other columns in this table. So I have got CourseTitle as Intro to SQL. Seats, five seats available. It's in room 14 and a lot of this information is unique to this one entry and that's fine.
But second normal form demands that all my non-key columns, things aren't keys, CourseTitle, Seats and Room, they have to be dependent on the entire primary key. Now that is the case for Seats and Room. Those are unique values based on the fact that we're running this course on a particular date in a particular room with a certain number of seats available. But CourseTitle I could get from just the course ID part of the key. This might sound a bit ivory tower, but here's the impact.
What happens if somebody reaches into this table and they change that course ID, because accidentally it was SQL101? It's now changed to ASP101. Well, now I've got the wrong title for a piece of data. That's because my data is not in second normal form, and if I now look at this row, ASP101, Intro to SQL, well, which one is right? Is it the wrong ID or the wrong title? I don't know. So how do we fix this? Well, once again we're going to rip out the CourseTitle.
We're going to create a separate courses table where we can map the ID from the events table to the ID in the courses table and always have one specific value for one specific ID, and everything in the events table is all based on the whole key, in this case both ID and Date. Again, if you're not using compound keys, it's not really a concern. You can just step ahead, go right through second normal form and into third normal form. About as plain English as I can describe this one is that no non-key fields, and things that are not part of the primary key, none of them are dependent on another non-key field.
This is in a way similar to second normal form. It's still saying, can I figure out any of the fields I have from other fields that I have? So for example, I'm looking at an updated version of the events table. This is in both first normal form and second normal form, but it's not in third normal form. It's not in first normal form. I have got my key. I don't have any repeating groups and I don't have any repeating values within a column. It's actually in second normal form because I have decided to change it to have one column primary key, which is EventID, but it's not in third normal form. Why? Well, what I can do is scan through my non-key fields, which for me is everything other than EventID.
SQL101 is occurring on the 2nd of April. There are apparently five seats available. That's being held in room 14. There is a capacity of 18. These values, the date, the availability, the room, could be different from row to row so they're fine. The issue is with the Capacity column. If we are looking at the room, so Room 14 has 18 seats and Room 11 has 24 seats and Room 8 has 12 seats, well that means one non-key field that we have, Capacity, is dependent on another non-key field, Room.
If we can figure out Capacity from Room, we don't need to store in the same table. We need to, you've guessed it, split this out into its own table. So we need to pull out Capacity from this table and just keep Room. That's as long as Room can always tell us the capacity if we have, say, a Room table. Another example of third normal form would be this, which is quite common. You'll often see it's an orderItems table, which has an ID and a ProductID and a UnitPrice and a Quantity and a Total, but if I look here the Total is based on Quantity times UnitPrice.
Quantity and UnitPrice are both non-key fields. We can figure out what total is from the other fields that we have. So we rip it out. Don't store information that's easily ascertained from other non-key fields. Now, you can actually in SQL Server create what's called a computed, a calculated field that's not really stored in the database. So if you wanted this actual behavior to make an easy way to scan the Total particularly when you've got complex quantities, you can do that and I'll show you that a little later. But don't store it because there is nothing that would stop me from storing UnitPrice for 100, Quantity of 3, Total 75,000.
It doesn't have to make sense and we want our data to make sense. Now in fact third normal form is quite an odd one, because you will actually find that a lot of tables out there are not in third normal form. A classic example is any table that's full of address information. If I look at a table like this and I see that I've got PostalCode being stored as the last column here, well, I can figure out what the City, the State and the name of the state are from the PostalCode. That means I have non-key fields that are dependent on another non-key field.
You've probably had situations yourself where if you're talking to someone on the phone and filling in an address, they don't actually ask for the city and state. They just say, "Can I get the postal code?" because that's all they really need is the postal code. If I get the postal code, I can figure out the rest of it. Now AddressLine1 is not dependent on the PostalCode. So AddressLine1 isn't a problem. It's City, State and the name of the state. We could rip out that information and put it in a separate table. Now, a lot of the time we don't do that just because it makes it easier and quicker to scan through, say, address tables, and in fact the process of taking all the way to third normal form, ripping the stuff out and then deciding to put it back in, is what's called denormalization.
But make no mistake. If you decide to store the City and the State and the name of the state information in your address table, you are storing redundant data. There is no real reason why you should store the ZIP code 91502 and the city Burbank and the state of CA, and the full name of California a thousand times, when you could get it all from having a zip code database with a city and a state in it. So you might denormalize to make things a bit more efficient, but do it knowingly. And those really are the three steps that we would go through.
You can take normalization even further into what are called voice card and fourth and fifth normal forms, but that's really not very typical and I've very, very rarely run across that. We want to take our database designs through the first normal form, about our primary key and on non-repeating fields. Our second normal form, making sure our data is based on the whole key, and third normal form, that all of our data is based on the whole key, or if you prefer, the quicken mnemonic is that your data should always be based on the key, the whole key and nothing, but the key. So help me God.
Get unlimited access to all courses for just $25/month.Become a member
82 Video lessons · 82688 Viewers
80 Video lessons · 133889 Viewers
52 Video lessons · 67157 Viewers
59 Video lessons · 52994 Viewers
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.