Learn the backstory behind Spark and how it was created as a project at Berkeley only a few years ago.
- [Instructor] Let's take a look now at the origins of Spark. To go back to the origins of our modern data ecosystems which started with Hadoop, we have to start in 2003. In 2003 some developers started working on this new idea of creating an open distributed computing platform. These guys started a project called Nutch. Later in 2006 those same developers were hired by Yahoo and released this project as an open source project called Hadoop. Around the same time Google created a Java interface for working with its data called MapReduce.
As Hadoop grew in popularity to store massive volumes of data, another new startup with massive volumes of data, Facebook, wanted to provide their data scientists and analysts an easier way to work with the data in Hadoop. So, they created Hive. At this point we have two ways of interacting with Hadoop data, through MapReduce which is Java based, batch oriented and pretty slow. Then we have Hive which is basically a SQL abstraction on top of MapReduce. So, while it's easier to write the queries, the same issues of being batch oriented and slow still persisted.
This is where Spark was born. At this point lots of tech companies were adopting Hadoop and other big data solutions, yet there weren't really any great interfaces for data scientists to use, so they created their own project which wasn't dependent upon Hadoop. So, in 2009 a few folks at UC Berkeley started a new project to provide easier access to big data for data scientists. This is the actual inception of Spark. A year later the team open sourced Spark using the BSD license and the world was introduced.
A full four years later, actually three years in 2013, the team donated their code to Apache but it wasn't until 2014 when Spark became an official top-level Apache project, which is a big deal. This means that Spark is now going to have a huge community of data professionals supporting and evolving the platform.
- Understanding Spark
- Reviewing Spark components
- Where Spark shines
- Understanding data interfaces
- Working with text files
- Loading CSV data into DataFrames
- Using Spark SQL to analyze data
- Running machine learning algorithms using MLib
- Querying streaming data
- Connecting BI tools to Spark