Understand different methods of setting up big data systems both on-premise and on the cloud.
- [Narrator] I've been hearing a lot lately that big data systems are easy to set up, that we've figured it out. And there are some ways to optimize, or certainly it's a lot easier now than it was in previous years, but really I feel that this is a myth. Big data systems aren't really that easy to set up when you consider the full spectrum of what somebody might consider easy. First, let's take a look at Cloud setup considerations. With Cloud setups, there are things to consider, like whether or not you use a service, such as Hadoop as a service, such as EMR, which is the Elastic MapReduce service by Amazon, or do you just host the machines, so instead of having your own physical hardware, having someone else do that for you.
Amazon, Google, Microsoft, all of them offer both of these services. The next thing to think about with Cloud setup is the response times and the backup. These are often referred to as SLAs, or Service Level Agreements, and these are really critical when you decide how to go about your Cloud setup because different regions and different areas and different types of environments will have different SLAs getting you different results, so, when you do your needs assessment and you try to figure out what the needs are for your big data platform, you're really going to want to focus on the response times and the backup protocols.
You also need to think about zoning. So, recently, there were some legislation passed in the European Union that has changed some of the rules about where data can physically reside. If you have clients in Europe and you have sensitive, personal information about them, that data needs to reside within the EU, so, while the Cloud does make this easier because you can simply choose the right zone, it may complicate things in terms of your configuration, how you actually access that data, from different parts of the world. The last thing, as I just mentioned, is access.
How are people actually going to access the data? Are they going to use a BI or analytics tool? Are they going to write a program against it with something like Java or .net or Python? Or do they want to use SQL, so the platform you choose should have some nice, friendly interface for them to use. Not all of them offer all of those things and, depending on what your needs are, you're going to need to focus on those considerations. Now the on-premise side, where you actually have the hardware in your physical location is a lot more involved and, of course, not something that is very easy.
First off, you have to choose the right hardware and, depending on how big your company is, you may have some other constraints from the data center team that you have to work around. You have to figure out the physical locations, so where should those systems actually be located? Just like when you choose a zone for your Cloud environment, you're going to need to think about where the data centers are located. What if a data denter catches fire? Will that totally take your system down, or is it replicated to a different location? Also, with a physical environment, you have a unique set of considerations around physical security, so who actually has access to the building, when and how are those things monitored, do we have records and can we actually find that information out if there ever was a breech and it was determined that it happened at the physical location? Lastly, there's network access.
So, sometimes, when you're really doing this, it can really complicate things because how your network is set up and how your big data platform is set up need to all work together. Your systems need to be able to talk to this, as well as the people in your actual company. So, that's fine and good, but you may want to have an application that lives outside of your firewall that also has access to this. So how do you set that up? An on-premise installation, it can be a bit more complicated than your Cloud one, so you really have to think about that before you go forward with an on-premise installation.