Add MapReduce description to core hadoop (#24073)

2018-10-23 00:15:27 +05:30 · 2018-10-23 00:15:27 +05:30 · 71559e7aa8
parent e17ffcdfd9
commit 71559e7aa8
1 changed files with 6 additions and 1 deletions
--- a/guide/english/data-science-tools/hadoop/index.md
+++ b/guide/english/data-science-tools/hadoop/index.md
@ -30,7 +30,12 @@ At the time of its release, Hadoop was capable of processing data on a larger sc

 Data is stored in the Hadoop Distributed File System (HDFS). Using map reduce, Hadoop processes data in parallel chunks (processing several parts at the same time) rather than in a single queue. This reduces the time needed to process large data sets.

-HDFS works by storing large files divided into chunks, and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.
+HDFS works by storing large files divided into chunks (also known as blocks) , and replicating them across many servers. Having multiple copies of files creates redundancy, which protects against data loss.
+
+MapReduce is a parallel processing framework which utilizes three operations in essence:
+- Map : Each datanode processes the data locally  and spits it out to a temporary location.
+- Shuffle : The datanode shuffles(redistributes) the data based on an output key , thus ensuring that each datanode has data related to a single key.
+- Reduce : Now the only task left for each data is to group data per key . This task is independent across datanodes and is processesed parallely.

 ### Hadoop Ecosystem