Hadoop 2 Quick-Start Guide

Douglas Eadline  
Total pages
October 2015
Related Titles

Product detail

Product Price CHF Available  
Hadoop 2 Quick-Start Guide


An easy, accessible guide to Big Data technology, this book covers all the basics students need to know to install and use Hadoop 2 on both personal computers and servers, and navigate the entire Apache Hadoop ecosystem. Hadoop 2 is demystified; This guide explains the problems Hadoop solves, shows how it relates to Big Data, and demonstrates both administrators and users work with it. From its Getting Started checklist/flowchart to its roadmap of additional resources, Hadoop 2 Quick-Start Guide is the perfect Hadoop 2 starting point for students to master Big Data.


  • Helps students get Hadoop up and running fast with clear, well-tested beginner-level instructions and examples
  • Includes hands-on coverage: HDFS, running programs, benchmarking, MapReduce, higher-level tools, YARN, administration, and more
  • Demystifies Hadoop 2

Table of Contents

Foreword    xi

Preface xiii

Acknowledgments    xix

About the Author xxi


Chapter 1: Background and Concepts    1

Defining Apache Hadoop  1

A Brief History of Apache Hadoop  3

Defining Big Data  4

Hadoop as a Data Lake  5

Using Hadoop: Administrator, User, or Both  6

First There Was MapReduce  7

Moving Beyond MapReduce with Hadoop V2   13

The Apache Hadoop Project Ecosystem   15

Summary and Additional Resources   18


Chapter 2: Installation Recipes    19

Core Hadoop Services   19

Planning Your Resources   21

Installing on a Desktop or Laptop   23

Installing Hadoop with Ambari   40

Installing Hadoop in the Cloud Using Apache Whirr   56

Summary and Additional Resources   62


Chapter 3: Hadoop Distributed File System Basics 63

Hadoop Distributed File System Design Features   63

HDFS Components   64

HDFS User Commands   72

HDFS Web GUI   77

Using HDFS in Programs   77

Summary and Additional Resources   83


Chapter 4: Running Example Programs and Benchmarks 85

Running MapReduce Examples   85

Running Basic Hadoop Benchmarks   95

Summary and Additional Resources   98


Chapter 5: Hadoop MapReduce Framework    101

The MapReduce Model   101

MapReduce Parallel Data Flow   104

Fault Tolerance and Speculative Execution   107

Summary and Additional Resources   109


Chapter 6: MapReduce Programming 111

Compiling and Running the Hadoop WordCount Example   111

Using the Streaming Interface   116

Using the Pipes Interface   119

Compiling and Running the Hadoop Grep Chaining Example   121

Debugging MapReduce   124

Summary and Additional Resources   128


Chapter 7: Essential Hadoop Tools    131

Using Apache Pig   131

Using Apache Hive   134

Using Apache Sqoop to Acquire Relational Data   139

Using Apache Flume to Acquire Data Streams   148

Manage Hadoop Workflows with Apache Oozie   154

Using Apache HBase   163

Summary and Additional Resources   169


Chapter 8: Hadoop YARN Applications 171

YARN Distributed-Shell   171

Using the YARN Distributed-Shell   172

Structure of YARN Applications   178

YARN Application Frameworks   179

Summary and Additional Resources   184


Chapter 9: Managing Hadoop with Apache Ambari 185

Quick Tour of Apache Ambari   186

Managing Hadoop Services   194

Changing Hadoop Properties   198

Summary and Additional Resources   204


Chapter 10: Basic Hadoop Administration Procedures   205

Basic Hadoop YARN Administration   206

Basic HDFS Administration   208

Capacity Scheduler Background   220

Hadoop Version 2 MapReduce Compatibility   222

Summary and Additional Resources   225


Appendix A: Book Webpage and Code Download 227


Appendix B: Getting Started Flowchart and Troubleshooting Guide    229

Getting Started Flowchart   229

General Hadoop Troubleshooting Guide   229


Appendix C: Summary of Apache Hadoop Resources by Topic 243

General Hadoop Information   243

Hadoop Installation Recipes   243

HDFS   244

Examples   244

MapReduce   245

MapReduce Programming   245

Essential Tools   245

YARN Application Frameworks   246

Ambari Administration   246

Basic Hadoop Administration   247


Appendix D: Installing the Hue Hadoop GUI    249

Hue Installation   249

Starting Hue   253

Hue User Interface   253


Appendix E: Installing Apache Spark   257

Spark Installation on a Cluster   257

Starting Spark across the Cluster   258

Installing and Starting Spark on the Pseudo-distributed Single-Node Installation   260

Run Spark Examples   260


Index   261



Douglas Eadline began his career as a practitioner and a chronicler of the Linux cluster HPC revolution and now documents Big Data analytics. Starting with the first Beowulf Cluster how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering virtually all aspects of High Performance Computing (HPC). Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld Magazine, and was senior HPC editor for Linux Magazine. Currently, he is a writer and consultant to the HPC/Data Analytics industry and leader of the Limulus Personal Cluster Project (limulus.basement-supercomputing.com). He authored Hadoop Fundamentals LiveLessons, Second Edition (2015), and Apache Hadoop YARN LiveLessons (2014), and is coauthor of Apache Hadoop™ YARN (2014), all from Addison-Wesley.