Learn about file systems used with Hadoop. These include HDFS, and cloud-based data lakes, such as AWS S3 and Google Cloud Storage (GCS).
- [Instructor] Let's talk a little bit more about modern file systems for Hadoop. Core HDFS has not changed much. Now with some of the newer Apache distributions, the time of the recording of my Hadoop fundamentals course, we were on Apache distribution number 2.5 and the most current as of this recording is 2.7. There have been some improvements to the core distribution around enterprise needs such as encryption. However, the core remains pretty much similar and it's used for the types of batch and extract transformation and load jobs that it traditionally has been.
What's new on the file systems for Hadoop is this idea of cloud-based file systems. So for Amazon, that's S3. For Google, that's Google Cloud Storage. And for Microsoft Azure, that's Blob storage. This is sometimes called a data lake, and this is really changing the landscape of Hadoop because it's decoupling the need to take data that you pull in to a file location and then transport it or move it over to HDFS. What it does is it enables more use of Hadoop because there's less steps in the processes and because the cloud-based file systems are cheaper to store massive amounts of information than using HDFS, either locally or on the cloud.
So this is a really important change in the world of Hadoop and I see more and more customers being interested in applying Hadoop style processes or MapReduces or some of these new processes like Spark Jobs to files that they traditionally would just have a S3 and they would use some other mechanism for querying and looking at them. Examples of these files would be log files from all their nodes of their networks globally. Another example would be event files, from IoT devices. Another example would medical information.
In addition to the cloud-based file systems, there are certain commercial vendors and we're going to be focusing on Databricks, who I think is really leading the industry now, that are creating enhancements to the HDFS file system and offering alternatives. Their file system is the Databricks file system, and we'll be working with that in this course as well.
Author
Released
7/5/2017- Relate which file system is typically used with Hadoop.
- Explain the differences between Apache and commercial Hadoop distributions.
- Cite how to set up IDE - VS Code + Python extension.
- Relate the value of Databricks community edition.
- Compare YARN vs. Standalone.
- Review various streaming options.
- Recall how to select your programming language.
- Describe the Databricks environment.
Skill Level Intermediate
Duration
Views
Related Courses
-
Apache Spark Essential Training
with Ben Sullins1h 27m Intermediate
-
Introduction
-
Welcome53s
-
-
1. Hadoop Core Fundamentals
-
Modern Hadoop1m 53s
-
Hadoop libraries1m 23s
-
Run Hadoop job on GCP1m 52s
-
Databricks on AWS2m 32s
-
-
2. Setting Up a Hadoop Dev Environment
-
Load data into tables1m 51s
-
3. Hadoop Batch Processing
-
Processing options1m 2s
-
Resource coordinators1m 30s
-
Compare YARN vs. Standalone1m 30s
-
-
4. Fast Hadoop Options
-
Big data streaming1m 57s
-
Streaming options1m 10s
-
Apache Spark basics1m 46s
-
Spark use cases1m 2s
-
5. Spark Basics
-
Apache Spark libraries3m 24s
-
Spark shell1m 53s
-
-
6. Using Spark
-
Tour the notebook5m 29s
-
Import and export notebooks2m 56s
-
Calculate pi on Spark8m 19s
-
Import data2m 50s
-
Transformations and actions4m 43s
-
Caching and the DAG6m 49s
-
7. Spark Libraries
-
Spark SQL8m 34s
-
SparkR6m 11s
-
Spark ML: Preparing data4m 21s
-
Spark ML: Building the model3m 50s
-
MXNet or TensorFlow2m 30s
-
Spark with GraphX2m 12s
-
-
8. Spark Streaming
-
Spark streaming4m 21s
-
9. Hadoop Streaming
-
Pub/Sub on GCP3m 59s
-
Apache Kafka1m 26s
-
Kafka architecture1m 6s
-
Apache Storm1m 30s
-
Storm architecture1m 36s
-
-
10. Modern Hadoop Architectures
-
Conclusion
-
Next steps26s
-
- Mark as unwatched
- Mark all as unwatched
Are you sure you want to mark all the videos in this course as unwatched?
This will not affect your course history, your reports, or your certificates of completion for this course.
CancelTake notes with your new membership!
Type in the entry box, then click Enter to save your note.
1:30Press on any video thumbnail to jump immediately to the timecode shown.
Notes are saved with you account but can also be exported as plain text, MS Word, PDF, Google Doc, or Evernote.
Share this video
Embed this video
Video: File system used with Hadoop