Beyond Storage

BigData Investigation 10 – Using Hadoop Streaming on Hadoop Cluster in Pseudo-Distributed Mode

In this post I will explain how to run the Hadoop Streaming utility on a Hadoop Cluster in Pseudo-Distributed Mode. Hadoop Streaming uses executables or scripts to create a MapReduce job and submits the job to a Hadoop cluster. In an earlier post I have explained how to run Hadoop Streaming in Standalone (Local) Mode. …

BigData Investigation 9 – Installing Apache Hadoop in Pseudo-Distributed Mode

In this post I will explain how to configure Apache Hadoop in Pseudo-Distributed Mode. In an earlier post I have explained how to install Apache Hadoop in Local (Standalone) Mode. Now I will apply the required configuration changes to turn that cluster into Pseudo-Distributed Mode. Step 1 – Install Apache Hadoop in Local (Standalone) Mode: …

BigData Investigation 8 – Using Hadoop Streaming on Hadoop Cluster in Local (Standalone) Mode

In this post I will explain how to run the Hadoop Streaming utility on a Hadoop Cluster in Local (Standalone) Mode. Hadoop Streaming uses executables or scripts to create a MapReduce job and submits the job to a Hadoop cluster. In an earlier post I have explained how to download and install Apache Hadoop in …

BigData Investigation 7 – Installing Apache Hadoop in Local (Standalone) Mode

In this post I will explain how to download Apache Hadoop and install it on CentOS 7 Linux in Local (Standalone) Mode. In earlier posts I have used the Cloudera Quickstart VM to describe how to create MapReduce applications with Python and Hadoop Streaming. Using pre-configured Hadoop clusters like the Cloudera Quickstart VM is convenient …

BigData Investigation 6 – Hadoop Cluster Modes

In the last post (BigData Investigation 5 – MapReduce with Python and Hadoop Streaming) we came across different Hadoop cluster modes. This post explains the three supported Hadoop cluster modes. A Hadoop cluster can be configured in one of three modes. Fully-Distributed Mode allows to configure Hadoop clusters ranging from a few nodes to thousands …

BigData Investigation 5 – MapReduce with Python and Hadoop Streaming

In this post I will explain the Hadoop Streaming utility. Hadoop Streaming uses executables or scripts to create a MapReduce job and submits the job to a Hadoop cluster. Hadoop’s programming model is called MapReduce. In a previous post I have explained MapReduce using a Unix pipe which includes two Python scripts and a few …

BigData Investigation 4 – MapReduce Explained

In this post I will explain MapReduce. MapReduce is Hadoop’s programming model to analyze data. I use the Hadoop Book for my investigation on BigData. MapReduce is covered in chapter 2. Let’s study the examples to understand MapReduce. All code examples of the Hadoop Book are available at GitHub. First we need to copy the example data …

BigData Investigation 3 – Installing the Cloudera QuickStart VM on VirtualBox

In this post I will show how to install the Cloudera Quickstart VM on VirtualBox. I need a Hadoop cluster to try the examples in the Hadoop Book. Appendix A of the book describes how to install Hadoop. Though, there is also a hint to use a virtual machine (VM) which comes with a pre-configured, …

BigData Investigation 2 – My Travel Guide: The Hadoop Book

There are plenty of online courses available which introduce Hadoop. Though as old hand I prefer a book. I browsed in my preferred online book store and ordered “Hadoop: The Definite Guide” by Tom White. I chose this book for several reasons. First, the book provides a plenty of code examples which can be downloaded …

BigData Investigation 1 – Introduction

Hi there, I am Ulf Troppens. I start this blog to share insights of my investigation on BigData. As storage professional I am wondering about BigData and Hadoop. I support several customers who operate multi petabyte filesystems for years now. I read that some Hadoop installations are very big (multiple 100PB), though it seems that …