In this movie, you'll learn the concept of distributed storage.
- [Voiceover] HDFS is the Hadoop Distributed File System.…The distributed part just means…that it can run across multiple machines.…In this particular course we are going to use…a virtual machine that comes with a…single DataNode and a single NameNode.…So, typically clusters come with a…replication factor of three…and all that means is that there are…three copies of every single block.…In this case there will be one copy…of every single block existing…on a sole DataNode.…We're not super worried about retention…or data loss, just because it's a single DataNode.…The probability of a failure is fairly low.…So to start off with some terminology,…a DataNode in Hadoop or in HDFS, rather,…is just what contains the blocks…for the file system and contains…the actual data that you're reading…off of the cluster and then a NameNode…doesn't actually hold any of that data…but it knows where all of that data is.…
So it contains the mappings between each block…and what DataNode that block exists on.…So, to summarize, the NameNode knows where everything is…
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups