Join Alan Simon for an in-depth discussion in this video Managing a big-data-driven DW project, part of Transitioning from Data Warehousing to Big Data.
- Even with all of the advanced and extremely powerful technology you'll be working with, the years of best practices gathered for millions of project management efforts still apply to implementing big data and analytics. Beyond these project management fundamentals though, you need to be very aware of specific concerns that usually arise to the surface any time an organization embarks on a significant paradigm shift as you'll be doing here, by transitioning from traditional data warehousing to big data. A paradigm shift means having to think and act differently, even though on the surface you may have a number of similarities between the old, traditional ways of doing things, and what you now want to accomplish, so whereas a data warehouse and a big data environment both have similar purposes to consolidate and integrate data, we need to think and act very differently for big data, than we did for data warehousing.
Specifically, you need to pay attention to the following areas to make sure that everyone on your team is on board with what you need to accomplish and how to do so. You need to make sure that what you build out does not look too much like traditional data warehousing, you want to avoid a common problem with big data where your information is not cleansed or standardized at all. You want to make sure that what you eventually develop is not too heavily weighted towards traditional business intelligence, or descriptive analytics.
You want to pay special attention to the data fragmentation problem and try to make that go away as quickly as possible. And then finally, you want to avoid a common problem with reporting and analytics, which is the lack of follow through to perscriptive actions. Let's look at each of these in a little more detail. One of the fundamental premeses of data warehousing is being very selective about the data we bring in and consolidate. And we've seen how one family of data warehousing methodologies is based on identifying specific reports and dashboards and other uses of our data, and even the other family of methodologies those that are data first in nature, they still require some degree of selectivity for the data we bring in.
We need to be careful not to compromise our big data efforts by falling back into these practices. We're practicing ELT, not ETL, which means that we should avoid needing approvals and validations and other governants for every piece of data we go after. We want to be very fast and very agile as we bring in data, and not fall back on our old practices. At the same time, there's somewhat of a myth about big data that we no longer need to standardize or cleanse our data at all once it arrives into Hadoop.
The truth is, that despite embracing the ELT paradigm of ingesting data as quickly as possible, we still need to focus on master data management and data standardization and data quality every bit as much, we just do so within the Hadoop environment rather than before it lands in our target system. If we don't take care of data quality and standardization, we'll soon be facing issues with our current reports and dashboards and wind up putting the entire big data effort at risk. Many of our team members may have deep expertise in data warehousing and business intelligence which is good in some ways, but also could be a problem if they wind up focusing almost totally on "tell me what happened" types of reports.
The traditional BI, or what's known now as descriptive analytics. They may wind up producing only scattered or fragmented predictive analytics, and in fact, little or no discovery analytics, so we need to guard against this particular problem, and make sure they focus on this entire analytics continuum. Addressing data fragmentation is something we tried to do going back to the earliest days of data warehousing, and with big data, we still may find ourselves facing a fragmentation problem across our enterprise.
We may see significant usage of existing data marts and spreadmarts, or data marts built on top of spreadsheets, and in fact, we may actually see new data marts and spreadmarts popping up in different business organizations. We'll also wind up seeing analytic silos in SAS or SPSS which is where the predictive and discovery analytics are recurring. We want to be on guard against this, because this could be a significant problem for our overall efficiency and cost effectiveness of our big data effort.
And finally, data fragmentation may occur even within our big data environment where we produce many different departmental reports and analytics, but don't do a very good job of cross functional reporting and analysis. And finally, one of the important things we're after with big data, is carrying our data driven insights through to decisions and then making sure that we take appropriate actions. Very often, we do a decent job at producing predictive and discovery analytics to tell us what's likely to happen, or interesting and important things in our data but we wind up not doing anything with them.
We might wind up saying, time and again, "If we had only known" or, "Maybe we did know something, "but why didn't we do something about it?" So we want to guard against these problems as well. Here's some critical success factors to help deal with these issues. We want to make sure that everyone who's working with a business as an environment is being built, or even after it's operational and we're developing new analytics, that these individuals are advocating the entire analytics continuum. Descriptive, predictive, and discovery looks into the past, the present, the future, and the unknown.
Also, that they're driving all of those analytics through to logical perscriptive end points that actions are being taken on the data-driven insights and the decisions that are being made. We want to make sure the data fragmentation is top of mind, and we want to do our best to prevent new data marts and spreadmarts from popping up, and we want to have a retirement plan for our existing data marts and spreadmarts as a key part of our overall road map. We want to make sure that our architecture and our roadmap are aligned with key business imperatives such as strategic sourcing, supply chain re-engineering, or major business process work, and finally, we need some sort of a chief data officer either by official title or at least in a de facto role, someone who oversees the entire efforts.
But focusing on these critical success factors, that will help us significantly corral the typical issues that we see whenever we're involved in a paradigm shift such as moving from data warehousing to big data.
- Exploring big data, Hadoop, and analytics
- Examining the shortcomings of traditional data warehousing
- Comparing big data architectures for next-generation data warehousing
- Understanding alternatives
- Building a roadmap
- Managing big data-driven projects
- Monitoring and measuring success