From the course: Database Foundations: Data Structures

Data integrity concepts

From the course: Database Foundations: Data Structures

Start my 1-month free trial

Data integrity concepts

- [Instructor] Databases are built to store data. And when you go through all of the trouble of setting up a server and preparing a database, you want to ensure that the data that eventually goes in is trustworthy and that it has integrity. But what does it mean for data to have integrity? To me, it means that I can rely on the data being accurate, complete, and consistent. With proper database design, we can build in a number of different systems that help maintain a trustworthy data repository. To do so, we need to approach database design without a well-thought-through plan, a construction blueprint of sorts, and not just start putting pieces together on the fly. By planning ahead and taking advantage of the technologies provided by modern relational database management systems, we can build a solid home for our data that protects it from becoming untrustworthy. So this is the overarching theme for this course, learning how to build a database that maintains the integrity of your valuable data. The data structures that you create to house your data will go a long way to providing these kinds of protections. Let's take a look at some examples. Accuracy describes how close a stored value is to its intended value. In an ideal world, these would be exactly the same, but we've all made typos, accidentally misspelled a word, or typed in a number incorrectly. Automated sensors can have bugs or become uncalibrated and start recording incorrect measurements. Rather than just accepting every value that's entered, the database can be built in a way that's smarter. By evaluating data for accuracy at the point of entry, you can build in validation rules and checks to help identify potentially inaccurate values and then not allow it to be stored in the first place. Completeness is a measure of how thorough every record is. Incomplete data impacts the usefulness of the entire dataset beyond just the row that might be affected. There shouldn't be any gaps or missing values in the data whenever possible. For instance, if I wanted to find out how much revenue a particular store made in March and that value is missing from the database, then not only can I not answer that specific question, but I also can't find out how much money the company made for the entire quarter or how much tax to pay for the fiscal year. Even if every other store has revenue reported, the entire dataset is useless for these kinds of larger questions, all because of one missing value. There's a trickle-down effect that happens when important values are absent from a dataset, so preventing this situation from occurring is critical to having a useful, trustworthy database. And finally, there's the problem of consistency. When you find two values in a database that should be the same but aren't, it's difficult to know which one is correct. Let's say that I have the monthly sales for a store recorded as $1,000 every day for an entire week, but then I also have the total weekly sales reported as $10,000. Clearly there's a discrepancy here where the numbers just don't add up. So did the store make $7,000 as indicated by adding up the reported daily revenues or 10,000 for the entire week? Which one of these is correct? Once you have to ask that question, you start wondering if perhaps both are wrong and the true value is something else entirely. Data discrepancies immediately call the integrity of the entire dataset into question. And you never want to be in a position where you can't separate truth from fiction. Luckily, many of these issues can be addressed through the careful design of the database right from the beginning. To address accuracy problems, constraints and rules can be added. Values must fall within an acceptable range before they're stored and follow the rules dictated by specific data types. For completeness, you can enforce required data columns and maintain relationships to data that's already stored as a way to minimize the amount of new values that need to be entered for every row. And to make sure that your data is consistent, it's possible to establish standardized data formats and lookup lists of acceptable values. For instance, names must be spelled according to a verified list. Also, you can not store the results of calculations, but instead store the raw input values only, and rerun calculations whenever you need them. With this kind of approach to database development, you can address and prevent many of the problems that make a dataset untrustworthy. Databases that are thoughtfully designed are easier to full useful information from, easier to manage, and easier to maintain.

Contents