Duplicate data in a database threatens the integrity of the database on a number of levels. Special care needs to be taken to maintain values at the same time across tables, or values can become inconsistent.
- [Speaker] The typical spreadsheet contains lots of columns that will contain the same piece of information over and over. When these types of structures are moved into a database environment, sometimes the same habits come along for the ride. The first problem in developing a relational database is not realizing when you're storing redundant information, and efforts need to be taken to eliminate any potential tendency towards storing duplicate data. Duplicate data are an issue for a couple of reasons. First, they slow down the system by taking up extra bandwidth across the network and extra disk space on the server.
Second, they're difficult to maintain, requiring multiple reads and writes to update records and keep things in sync. And finally, duplicated data runs the risk of becoming inconsistent. Let's take a look at how duplicates can occur and how the database's design can help us out of the situation. Here we have a simplified table that's keeping track of our customer's purchases. We have the data broken down by the invoice number, and we can see the name of the customer, their shipping address, what product they ordered and who manufactured that particular product. We have three invoices for the same customer, Bread Express, who ordered three different kinds of olive oil made by the same company, Two Trees.
So far, so good. This structure gives us all the information that we need to correctly ship the product to the customer. But what happens if Bread Express calls us up and says that they made a mistake? They recently moved and they need all the oils to go to their new storefront. In order to make that change in the database, we need to update the address on all three invoice lines. This creates a couple of opportunities to mix up our record-keeping. What happens if we only update two and miss the third? Or what happens if we type in two correctly but mistype one of them? This structure, while it initially seems like it will meet our needs, becomes difficult to manage in the long run.
The solution is to remove the components that repeat over and over again from the main invoice's table, and to place them in completely different tables. We can do this by determining which fields depend on which other fields. In our case, the address really isn't telling us information about the invoice itself, but it's about the customer that placed the order. In other words, the address depends on the customer. Wherever we see this is the case, we'll do the following: First, we'll remove the column that isn't dependent on the main topic of the table and place it in a new table.
Then, we'll make a copy of the column that it was dependent on and place that in a new table alongside. If we apply this to our invoice's table, we'll remove the "Address" column from the main table and create a new table that is more specifically about our customers. Now we have a simple way of looking up the correct address whenever we need it. When we want to know where to ship Bread Express' order, we'll look to this new customer's table we just made and we'll find the address listed there a single time. More importantly, in getting back to our example, it makes updates a breeze. We simply change the information in a single location now instead of having to go through and update every invoice.
Now you might have noticed that we have a similar situation here with our manufacturers. Although it might not be as apparent since those three invoices we're looking at are all for different products. Now you might have noticed that we have a similar situation here with our manufacturers. Although it might not be as apparent since the three invoices we're looking at are all for different products. Here, the product's manufacturer is describing the product being ordered, and is not really dependent on the invoice or the customer placing the order. I'm going to make a huge assumption here that we only have a single supplier for each type of oil. This may or may not be the case, and would be something that I need to verify before implementing the following change.
But if this is the case, then if we had more invoices showing, you would expect to see that every time First Cold Press Olive Oil was ordered that it was supplied by the Two Trees Olive Oil Company. Again, the solution is to decompose our tables into smaller groupings of information about a single topic. So we'll take out the "Manufacturer" column from the invoice's table and make a new table about "Products." Now if a manufacturer changes for a particular oil, it is a simple update to a single area of our database rather than going through every invoice and changing it there. So that's one way a well-structured database can help keep our information up-to-date.
By eliminating the redundancy that's often present in simpler spreadsheet files, such as we saw here with the customer addresses, and by making use of multiple tables to divide up the data into smaller related groups, it makes the data easier to manage in the long term.
Adam Wilbert covers the basics of relational database design, regardless of whether you use Access, FileMaker, Open Office, or SQL Server. Learn how to prevent data anomalies, gather requirements to plan your design, and develop a conceptual data model—translating your ideas into components like tables, relationships, queries, and views. Plus, learn about logical design considerations that can help you construct a database that is easy to maintain.
- What is a database management system (DBMS)?
- Moving through the database development cycle
- Preventing duplicate, inconsistent, and conflicting data entries
- Gathering requirements
- Developing relationships
- Identifying key fields
- Following a naming convention
- Developing the actual database