Join Alan Simon for an in-depth discussion in this video Countering the anti-big-data argument, part of Transitioning from Data Warehousing to Big Data.
- Even if all of your analysis work clearly tells you that Big Data is the next logical step for your organization, others in your organization, or maybe even you, might still have some misconceptions about Big Data and Hadoop as that next logical evolution of today's Data Warehousing. Let's take a look at some of those misconceptions and explore what the real story is behind each of them. Many people equate Data Warehousing with Relational Databases, and see the two as inseparable. Their argument is that the only possible platform for a Data Warehouse is a Relational Database such as SQL-server, or Oracle, or DB2.
The reality is that dating back to the earliest days of this modern Data Warehousing, we've always had different platforms, not just Relational Technology. In the 1990's, we saw many Data Warehouses and Data Marts built on multi-dimensional technology. And for a number of years, one the strongest arguments in this world of Business Intelligence and Data Warehousing was MOLAP vs. ROLAP. Or Multi-dimensional OLAP vs. Relational OLAP.
Eventually, relational technology won out. And then multi-dimensional cubes wound up being folded into the relational technology. But then, once the mid 2000's arrived, we started trying to build larger and larger Data Warehouses to the point where our Relational Databases started to become strained under the capacity of the data we were trying to work with. Technology, such as Data Warehousing Appliances, or special databases that looked relational but really used different platforms, started to become popular.
And they, in turn, wound up giving way to Hadoop and other Big Data solutions. The important thing to remember is that our BI tools and applications interface with our Data Warehouses through SQL. So underneath that layer, we could have tried and true relational technology, or newer Hadoop platforms. Think of it this way, when you get into a car to go to the grocery store, you're going to interface with that car to steer it the same way, and use the foot pedals the same way, even if the power plant for that car is a traditional gasoline engine, or all electric, or maybe a hybrid.
What's underneath that layer, through which we interface with the data, shouldn't matter. Whether it's relational or Hadoop. Many people still believe that SQL running on top of Hadoop is excruciatingly slow, making it unusable for business intelligence, and unsuitable for advanced Analytics. We've seen how the first generation of Hadoop technology had a Data Warehousing infrastructure called Hive, and a language called Hive QL. And it wound up passing through the Map Reduce System. Which did adversely impact performance in many situations.
However, in the second generation of Hadoop, many of the Hadoop vendors are creating their own solutions that bypass Map Reduce. And therefore dramatically increasing performance with their SQL interfaces to the underlying HDFS, Hadoop Distributed File System. Another common misconception that goes back to the earliest days of Data Warehousing is that the Data Warehouses should only contain extremely high quality Structured Data. The way we built Data Warehouses in the early 90's did require this.
But this is really an artifact of using Relational Data Bases, and the way we built Data Warehouses in the early 1990's. If we look at Business Intelligence and Analytics needs today, our data-driven insights often require unstructured and semi-structured data. And the way we architect Data Warehouses and Big Data environments these days, we have a higher tolerance for data that's not necessarily as clean as it could be. So we still bring that data in, as quickly as possible, and then run it through different layers within our overall architecture.
Many people believe that Hadoop is far too difficult to use and manage when compared to our Data Warehousing and BI tools. But as we mentioned earlier, the newest generation of Hadoop technology is far easier to use, and in fact, far more robust than the first generation of technology. And the usability and maintainability gap between Data Warehousing and Big Data is rapidly closing. Additionally, those with Business Intelligence and Data Warehousing skills can much more readily transfer those skills into this new world of Big Data than they could in the past.
Another misconception is that Hadoop has to be a multi-million dollar solution, suitable only for the largest companies and governmental agencies. And it's far too expensive to use for Data Warehousing and Data Marts, especially for smaller companies. The reality though, is that there are many lower cost, cloud-based Hadoop solutions that have some version of a subscription model, and even small and medium size companies can take advantage of this new technology without having to invest millions of dollars right upfront.
Hadoop, of course, is one of major Big Data platforms. But many organizations look at the volumes of data they have and think that there is no way that they can justify a Big Data solution. Hadoop or any other platform. What's important to understand though, is that the paradigm of using Hadoop and Big Data, ingesting data as quickly as possible, ELT instead of ETL, and all the things that we've seen, may still make sense for your organization even if you don't have petabytes of data.
And then coupled with lower-cost subscription models, you can still start to take advantage of Hadoop for advanced Analytics, without having to make significant investments, and even if you don't have incredibly large volumes of data. The Hadoop platform is infinitely scalable. And it's very easy to start small and then expand as needed. Many people still see Business Intelligence and Data Mining, our Predictive and Discovery Analytics, as two totally different disciplines. And therefore, they should be hosted and maintained in two totally different environments.
It's important to remember though, that modern Business Intelligence and Analytics should be thought of as a continuum. Including Descriptive, Predictive, Discovery, and Prescriptive Analytics. And if all we have are Descriptive Analytics, our traditional Business Intelligence, without the other models, then insights we gain are far less actionable than if we are able to support that entire continuum. Make sure you base your arguments for your architecture and your underlying technology on facts, not misconceptions.
Think about what you need to drive actionable insights for your organization, and that will take you in the right direction.
- Exploring big data, Hadoop, and analytics
- Examining the shortcomings of traditional data warehousing
- Comparing big data architectures for next-generation data warehousing
- Understanding alternatives
- Building a roadmap
- Managing big data-driven projects
- Monitoring and measuring success