How do Hadoop ChainMapper and ChainReducer reduce disk IO -
i want chain multiple mapreduce jobs i.e. previous mapreduce job's output input of next mapreduce job. because output large , disk io overload heavy, find alternative solution reduce io bottleneck. found chainmapper/chainreducer api. document mentioned following properties
"using chainmapper , chainreducer classes possible compose map/reduce jobs [map+ / reduce map*]. , immediate benefit of pattern dramatic reduction in disk io."
but don't quite understand why using chainmapper/chainreducer reduce disk io. , reduce io, how should use these 2 apis?
as per understanding, though have multiple mappers, chain mapper considers them single task.till task complete,there no intermediate write.
see below statement javadoc.
the mapper classes invoked in chained (or piped) fashion, output of first becomes input of second, , on until last mapper, output of last mapper written task's output.
Comments
Post a Comment