BigData Keywords: A BigData User Dictionary for related technologies definitions and much more


In this article you will learn

  • This article will have jargons from BigData related technologies.
  • It will give you best possible article links and books to understand topics more deeply.
  • As BigData related technologies are still evolving, this article will keep updating time by time based on my personal experience, learning and findings.

Hadoop characteristics

  • Hadoop is “data-parallel”, but “process-sequential”. Within a job, parallelism happens within a map phase as well as a reduce phase. But these two phases cannot run in parallel, the reduce phase cannot be started until the map phase is fully completed.
  • All data being accessed by the map process need to be freeze (update cannot happen) until the whole job is completed. This means Hadoop processes data in chunks using a batch-oriented fashion, making it not very suitable for stream-based processing where data flows in continuously and immediate processing is needed.
  • Data communication happens via a distributed file system (HDFS). Latency is introduced as extensive network I/O is involved in moving data around (i.e.: Need to write 3 copies of data synchronously). This latency is not an issue for batch-oriented processing where throughput is the primary factor. But this means Hadoop is not suitable for online access where low latency is critical.

Hadoop is NOT good at the following

  • Perform online data access where low latency is critical (Hadoop can be used together with HBase or NOSQL store to deliver low latency query response)
  • Perform random ad/hoc processing of a small subset of data within a large data set (Hadoop is designed to scan all data in parallel)
  • Process small data volume (for data volume less than hundred GB range, many more mature solutions exist)
  • Perform real-time, stream-based processing where data is arrived continuously and immediate processing is needed (to keep the overhead small enough, typically data need to be batched for at least 30 minutes, which you won’t be able to see the current data until 30 minutes has passed)


How Hadoop works

  1. Data broken into pieces of 64 or 128 MB blocks.
  2. Blocks moved to each node.
  3. Job Tracker start scheduler to track each node output.
  4. When all node done, final output generated.

Keywords in BigData Technologies

Data Locality Move computation closer to data to avoid network congestion.
GPU Graphic Processing Unit
Big Data 3V’s (Volume, Velocity, Variety)

Velocity- Rate to which data grows

Variety- Kind of data formats.

Ref: [Read it, example is awesome]

Big Data Problem 1.     How to store and compute efficiently

2.     Data analysis- How fast analysis one

3.     Total cost in doing above 2 steps

RDBMS scalability issue 1.     De-normalize and pre aggregate the data for faster query execution time is needed for Big Data which is main issue with RDBMS.

2.     Changes to indexes and query optimization time by time.

3.     No horizontal scalability- meaning can’t add more hardware to bring down computation time rather query tuning.

4.     RDBMS are for structured data.

RDD Model
  • Resilient Distributed Datasets, Spark introduced this concept- An immutable, fault tolerant distributed collection of objects that can be operated in parallel.
  • RDDs are collections of objects that are partitioned across the cluster and can be stored on the nodes
  • They’re built through graphs of parallel transformations, such as Map-reduce, and group-by, similar to the graphs that used to compute results in Dryad. And RDDs are automatically rebuilt on failure by the runtime system.
  • Spark offers this abstraction embedded in several programming languages, including Java Scala, and Python.
Data Model A way to store data in database
Replication Copy of same data from one to another node, for availability
Majority read/write
Latency Delay from input into a system to desired output
Hadoop A distributed system (with master-slave configuration) to handle Big data which typically solves following problems-

  • Data transportation
  • Scaling up and down
  • Handles partial failures of application
Hadoop core components
  1. HDFS [for storage]
  2. Map-Reduce [for processing]
Hadoop Cluster A set of machines which executes Hadoop’s core components – HDFS and Map-Reduce
  • A single machine in Hadoop cluster
  • And each node contains HDFS + Map-Reduce
Name Node Hadoop node having HDFS on master i.e. node which stores data for master
Data Node Hadoop node having HDFS on slave
Job Tracker Hadoop node having Map-Reduce on master
Task Tracker Hadoop node having Map-Reduce on slave
Apache Spark Open source distributed computing engine/framework for data processing and analytics. Its part of Hadoop technologies.

It supports verity of datasource (Kafka, MongoDB, HDFs, and Hive etc.), environments (Spring, Docker, Hadoop, OpenStack etc.) and applications (Mahout, Hive, and Thunder, Sparkling).

Spark has several components- Core, SQL, Streaming, MLlib

Spark Core is the base engine which supports-

  • Memory management
  • Fault recovery
  • Tasks management (schedule, distribute, monitor)
  • Storage system interaction

Spark supports- Iterative, Interactive and Batch data processing.

Note– Hadoop MapReduce (written in Java) is limited to batch data processing. While Hadoop MapReduce  stores data in disk, Spark stores in-memory hence Spark (written in Scala) is more of real time data processing.

Apache Mahout A machine learning library for Hadoop


What Next?
This space will keep update with time. Keep an eye on this, and I promise to share best of BigData technologies related information.

Happy Learning!!

Nirbhaya Bhava!!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s