Join Simon Allardice for an in-depth discussion in this video What are databases?, part of Programming Foundations: Databases.
- As with most things in programming, if you want to understand databases, then don't jump directly into the features of database software. First, go the other way. Realize these things were designed to solve a problem, so what's the problem? Well, that might seem like an easy one. You have some data, some information. You, your company, your clients, or even the business that you want to create, there is some data you need to store. And this data could be anything.
It could be information about customers, products, employees, orders, details about the visits to your website. This data could be in text format, it could be names and descriptions, or numeric amounts, or dates, or this could be document files or images, audio, or video. But here's the thing: you can already store data. You could just open up a text file and type your information into that, or open up a spreadsheet and do that too. If you've got documents, just organize them in folders.
Now, many small businesses would start with something like this, with a spreadsheet, and a lot will continue like this for years, and that might be OK, because just having data is not a good enough reason to need a database. Having data is not the problem. The problem is what comes next. And there's a lot of potential problems, but I'll describe six of them without even getting technical. Size of your data is a potential problem, ease of updating your data, accuracy of it, security of it, redundancy in it, importance of the data.
Let's take these one by one. First, size. Because what starts off as a small amount of data has a tendency to turn into a large amount of data. And if your spreadsheet solution is nice and speedy when there's 100 lines in it, what happens when there's 2,000,000 lines in it? Now, maybe you'd start to split that up into a bunch of different files, and then you hit a sub-problem, speed. How would you find anything? Next, there's the idea of ease of updating you data.
What happens if two people need to edit this spreadsheet at the same time? How about 20 at the same time, or 200 people at the same time? If you work with a file based system, that's just not going to happen. You'll end up with everybody overwriting everybody else's changes. Then, there's accuracy of the data. Is there anything that would actually prevent me from typing incorrect data into a spreadsheet like this, to fat finger a date as being in the past when it should be in the future, or just missing some vital piece of information out with the intention that I'll come back to that and do it later? Well, usually there's nothing that would prevent me from doing so, so as these files get older and older, they start to fill up with data that isn't accurate, it isn't consistent, it isn't trustworthy anymore.
And then there's the idea of security. Most of the time, you need to share this data, but just because it needs to be shared with someone, doesn't mean it needs to be shared with everyone, because some of this could be sensitive. It could be payroll information, health care information. So who would get to view it, and who would get to edit it, and if they do get to edit this data, is all we know just the last person who edited this file, or do we know who made every entry and every change at every point? Because with some information, health care, financial, there are laws in place to make sure you're auditing every alteration to sensitive data.
And there's the idea of redundancy, or duplication of your data, having multiple copies of the same data. Now, redundancy by itself is not a bad thing. With backups, it's good to have redundancy, but in the data itself, not so good, because it leads to conflicts. Maybe one file ends up with two entries for the same product. One says the product is $12, another says the product is $10. Which one's right? Or maybe that information is spread over a couple of different files, one owned by one department, one by another.
Which one is true? And then there's importance of your data. You've probably felt the pain of working in a spreadsheet or a word processor, where something happens. There's a crash or a disconnect, and you lose five minutes of work or an hour of work or a day. It's annoying, it's inconvenient. But if the information you just lost was yesterday's orders, or the allergies of a patient, or the details of a stock trade, or the seat allocation of a transatlantic flight that someone just paid for, that's not inconvenient that's unacceptable.
The data is critical. With many companies, your data is your entire business. You can't lose any of it, not a single change. Now, you might have one of these problems. You might have all of these problems and more besides. But these are the reasons we need a database. These are the problems. It's not about providing somewhere to put your data. We could already do that. It's about allowing that data to grow and still be manageable, still be easy to get to everything, about letting it be easily updated, even by lots of people at the same time, about keeping it accurate and internally consistent, keeping it secure and controlling access to it, who gets to read it, who gets to edit it, and tracking exactly what they did and when they did it.
It's about making sure that, while your backups might be redundant, because that data is critical, your data itself is not, there is one version of the truth, and that once you make a change to that data, you will never lose that change, you'll never lose that data. This is what a database can give us: structure. It's not a place to put your data. It's a structured system to put your data in that imposes rules upon that data, and the rules are yours, because the importance of these things changes, based on your needs.
Maybe your main problem is size, you have terabytes of information, then you'll have different priorities from someone who has much smaller amounts of data, but where every single change must be controlled, audited, and logged the moment that it's made. And this is why, if your experience is just using a database someone else has made, you haven't seen what's important. It's totally forgivable that someone whose only time working with databases is using an old FileMaker or a Microsoft Access database that they think this is what a database is.
It's like a spreadsheet, it's just more tedious to use. And that's not it at all. This is the true power of a database. It's invisible. It's the things you can't see that are going on in the background: the security, the enforced integrity of the data, the ability to get to it fast and get to it reliably, that it's robust, it can withstand lots of people at the same time, and even correctly survive crashes and hardware issues without corrupting the data. And that's what we need to do here, understand how to describe our structure and define those rules, so all this invisible stuff will actually happen.
Now, up to this point, I haven't gone into any particular database software or even any particular type of database, because there are different types, and some of them are better at one or more of these problems, and we'll get there, but there's a couple more terms to cover before we do.
- What is a database?
- Why do you need a database?
- Choosing primary keys
- Identifying columns and selecting data types
- Defining relationships: one-to-one, one-to-many, and many-to-many
- Understanding normalization
- Creating queries to create, insert, update, and delete data
- Understanding indexing and stored procedures
- Exploring your database options