Learn what sort of cost considerations you'll have to decide on when implementing a big data system. While it's true that these systems were designed to run on commodity (e.g., cheap) hardware, they often require many servers to run efficiently, increasing the cost.
- [Narrator] Now let's think about cost of big data and the myth that big data is, for some reason, cheap. If we look at the deployment models and we think about how we can actually deploy Hadoop system and Hadoop is, again, just one of the most popular big data platforms. And that's why I'm using it here. It's the oldest and we have the most data about it. The on premise installation option is one where you actually physically, stand this up in your data-center. If you're doing it right and you want the system to be reliable, in case of a failure of your data center, maybe there was an earthquake or flood or something like that, you're going to want to replicate that data into multiple data centers.
So at the minimum, you're talking a number of servers in your cluster, maybe 12 or more. Really just to get the base level. And each server, let's say it costs $40,000, which is a fair price for a server, then you're already looking at 200 grand before licensing fees, implementation costs, power and all that stuff. If we move over a little bit to the right and we go a step closer towards the cloud, there's the Hadoop Appliance option which is also an on premise option, and in this what you do is you kind of have a hardware solution already built up.
The software is installed, everything is wired in. They basically roll in a rack of servers or just install 'em into some racks that you already have set up. Turn it on, connect it to your network and boom. You're up and running. So that saves you a little bit of time on the instillation, configuration and all that, but it still is an expensive solution for you. Then you can host things online, if we're getting closer and closer to the cloud option, where I just clicked a couple buttons and choose my size of my instances and something like Amazon's EC2 and then I have Hadoop up and running.
So this again is something where you're just hosting it somewhere instead of having the physical machines yourself. You essentially are having Amazon or Google or Microsoft or whomever do that for you. And that's fine. But still you're going to need at least a dozen servers to really do it right. And then on the far side you have Hadoop as a service. This is an interesting one where, what happens is, is you don't actually have control of the servers or the configuration. You simply start throwing data at this Hadoop service and it handles it and does its own thing and scales appropriately. Now we take a look at recent study that was done at the Appache Hadoop summit, you can see how the cost, without considering risk, plays out for these different implementation scenarios.
The Hadoop as a services comes out to just over four million dollars over three years, on premise, about five and 1/2, the Amazon EMR, which would be the Hadoop as a service, comes close to seven. The Hadoop distribution on EC2 that's where they host the servers for you, just a slightly over right there at seven million dollars over three years. Not really cheap when you consider the cost of some of the other services you may be looking at. Now, if you throw in risk, and you think that there's going to be some viability in the vendor, whether or not that's going to work for you, architectural control, data protection, loss of intellectual property or loss of privacy.
If you throw in those things, to your risk model, and you calculate it then, things jump up quite a bit. So here you can see all of those things are almost doubled. I mean, some of these are approaching 120 million dollars over three years. So something to consider, the cost can be significant. Now, that said, if you have a very small amount of data, you really don't need to be going down this route just yet. You really should start to think about, is this even necessary?