Hadoop Fundamentals for Researchers

Apache’s open-source, Hadoop provides MapReduce processing as a cloud service. Hadoop can also be installed on a private cluster of processors. The cloud or cluster user submits pre-written programs in the MapReduce paradigm to process giga-, tera-, or petabytes of data. For example, Facebook has a Hadoop Distributed Filing System cluster processing 100 petabytes. As such Hadoop is a key software technology for running Big Data applications. Researchers can also utilise Hadoop; for example, the Blast algorithm for searching nucleotide databases is suitable for Hadoop implementation. In this tutorial, apart from outlining the Hadoop architecture and organisation, we consider the cloud concept followed by the contributors to Hadoop, such as Zookeeper, Google Filing System, and Google’s Big Table. A guide to estimating MapReduce performance will be provided. Operating at a higher-level than Hadoop, the tutorial lastly considers Scala and Spark, which essentially are software aids to programmer productivity.