Hadoop / Spark

Hadoop is a framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. In Amazon Web Services, Hadoop is available as a service named "Elastic MapReduce" (EMR).


Hadoop is frequently used by researchers for processing either massive data files (larger than you might be able to store on your local workstation) or a great quantity of files. It has become a critical element to companies and organizations that need to digest vast data sets, or continual data streams, on a regular basis.


The programming paradigm of Hadoop is called “mapreduce,” a two-tier process that both maps datasets and then reduces them into output data. Hadoop is one of the most frequently used tools when it comes to “big data”, as it can scale to thousands of servers (or more), tackling data sets of many petabytes (PB). Here are the two steps, explained further:

Mapping takes input data and processes it into key-value pairs.
Reducing takes these pairs, aggregates them, and then (often) performs some form of analysis upon them.