In this article you will learn
- This article will have jargons from BigData related technologies.
- It will give you best possible article links and books to understand topics more deeply.
- As BigData related technologies are still evolving, this article will keep updating time by time based on my personal experience, learning and findings.
- Hadoop is “data-parallel”, but “process-sequential”. Within a job, parallelism happens within a map phase as well as a reduce phase. But these two phases cannot run in parallel, the reduce phase cannot be started until the map phase is fully completed.
- All data being accessed by the map process need to be freeze (update cannot happen) until the whole job is completed. This means Hadoop processes data in chunks using a batch-oriented fashion, making it not very suitable for stream-based processing where data flows in continuously and immediate processing is needed.
- Data communication happens via a distributed file system (HDFS). Latency is introduced as extensive network I/O is involved in moving data around (i.e.: Need to write 3 copies of data synchronously). This latency is not an issue for batch-oriented processing where throughput is the primary factor. But this means Hadoop is not suitable for online access where low latency is critical.
Hadoop is NOT good at the following—
- Perform online data access where low latency is critical (Hadoop can be used together with HBase or NOSQL store to deliver low latency query response)
- Perform random ad/hoc processing of a small subset of data within a large data set (Hadoop is designed to scan all data in parallel)
- Process small data volume (for data volume less than hundred GB range, many more mature solutions exist)
- Perform real-time, stream-based processing where data is arrived continuously and immediate processing is needed (to keep the overhead small enough, typically data need to be batched for at least 30 minutes, which you won’t be able to see the current data until 30 minutes has passed)
How Hadoop works—
- Data broken into pieces of 64 or 128 MB blocks.
- Blocks moved to each node.
- Job Tracker start scheduler to track each node output.
- When all node done, final output generated.
Keywords in BigData Technologies—
|Data Locality||Move computation closer to data to avoid network congestion.|
|GPU||Graphic Processing Unit|
|Big Data||3V’s (Volume, Velocity, Variety)
Velocity- Rate to which data grows
Variety- Kind of data formats.
Ref: http://www.hadoopinrealworld.com/what-is-big-data/ [Read it, example is awesome]
|Big Data Problem||1. How to store and compute efficiently
2. Data analysis- How fast analysis one
3. Total cost in doing above 2 steps
|RDBMS scalability issue||1. De-normalize and pre aggregate the data for faster query execution time is needed for Big Data which is main issue with RDBMS.
2. Changes to indexes and query optimization time by time.
3. No horizontal scalability- meaning can’t add more hardware to bring down computation time rather query tuning.
4. RDBMS are for structured data.
|Data Model||A way to store data in database|
|Replication||Copy of same data from one to another node, for availability|
|Latency||Delay from input into a system to desired output|
|Hadoop||A distributed system (with master-slave configuration) to handle Big data which typically solves following problems-
|Hadoop core components||
|Hadoop Cluster||A set of machines which executes Hadoop’s core components – HDFS and Map-Reduce|
|Name Node||Hadoop node having HDFS on master i.e. node which stores data for master|
|Data Node||Hadoop node having HDFS on slave|
|Job Tracker||Hadoop node having Map-Reduce on master|
|Task Tracker||Hadoop node having Map-Reduce on slave|
|Apache Spark||Open source distributed computing engine/framework for data processing and analytics. Its part of Hadoop technologies.
It supports verity of datasource (Kafka, MongoDB, HDFs, and Hive etc.), environments (Spring, Docker, Hadoop, OpenStack etc.) and applications (Mahout, Hive, and Thunder, Sparkling).
Spark has several components- Core, SQL, Streaming, MLlib
Spark Core is the base engine which supports-
Spark supports- Iterative, Interactive and Batch data processing.
Note– Hadoop MapReduce (written in Java) is limited to batch data processing. While Hadoop MapReduce stores data in disk, Spark stores in-memory hence Spark (written in Scala) is more of real time data processing.
|Apache Mahout||A machine learning library for Hadoop|
This space will keep update with time. Keep an eye on this, and I promise to share best of BigData technologies related information.