Summary of the different Big Data Tools

CS4225 Summary (Hadoop Related Tools)

Performance Guidelines for Big Data Tools

What is Hadoop?

It is a framework that manages big data storage in a distributed way and processes it parallely.

Components of Hadoop:

Distributed File System (HDFS) (Storage unit of Hadoop):

HDFS is specially designed for storing huge datasets on commodity machines. They have 2 core components: (1) NameNode (2) DataNode.

There is only 1 NameNode. There can be multiple DataNodes.

Master/slave nodes typically form the HDFS cluster.

Namenode responsibilities:

Store meta data of the file system. Maintains and manages DataNode.

Directs clients to data nodes for writes and reads

No data moves through namenodes

DataNode responsibilities:

Store actual data. Perform reading, writing and processing work.

Performs replication

HeartBeat are the signals that data node continuously send to NameNode.

Hadoop MapReduce (Processing unit of Hadoop)

Processing is done on the slave nodes and the final results are returned to the master node.

Map Reduce Runtime


Note that we still write the results to disk after finishing the Map phase and before going to the Reduce phase to ensure stable storage.

Keys arrive at each reducer in sorted order. But there is no enforced ordering across reducers.

The sequence of execution of the mentioned components happens in the below order:
Mapper -> Combiner -> Partitioner -> Reducer


Hadoop + DB = HadoopDB

But..some more problems with Hadoop



NOSQL Databases


How is Presto different from Hive?


Problems with Map Reduce

-> Need to write in Java. While Hive provides a good interface, using just SQL programming language, it only has Mapper and Reduce API. In reality, we need to do more work than that.
-> For batch processing, not good for real time processing.
-> Unsuitable for trivial operations such as Filter, Joins -> Because of key value pattern
-> Performance. Cannot do well for data with lots of data shuffling. Each Map Reduce job reads and writes output to disk.
-> Unfit for processing graphs
-> State-less execution, doesnt fit with use cases like kmeans that needs iterative execution

(1) RDD

(2) DataFrame

Python DF, Java/Scala DF, SQL are DataFrame front end.

(3) DataSet

Check out another article which i have posted on Spark.

Characteristics of Graph Algorithms

Systems for Graph Data Processing

Graph Processing in Hadoop:


Pregel Model

Case Scenario application:

Pregel Implementation

More applications: PageRank, single-source shortest path, bipartite maching, and semi-clustering algorithms implemented using Pregel’s programming model.

Page Rank

Shortest Path

Refer to the original paper:

Refer to this very informative slides below for more on Pregel and Giraph.



GraphX= Spark for Graph

Systems for Streaming

Apache Storm

Nimbus and zookeeper are used for cluster management. A graph is mapped to the machines of the cluster.

To Explore More (TODO):

Overall Picture of Hadoop

Comparisons between Spark, Hadoop, Storm

Data Processing Model


Ease of Development

Fault Tolerance (Handling process/node level failures)

Why is efficient large scale graph processing challenging?

Graph Processing in Hadoop:

Comparisons between Pregel and Giraph

The End :)

This is a repository of my thoughts on my personal life, my random interests & notes taken down as I navigate my way through the tech world!