Which is better Tez or MR

MR

Read the data from the file (disk access =1)
Run mappers
Write map output (disk access =2)
Run shuffle and sort (read intermediate o/p of mapper, disk access =3)
write shuffle and sort (disk access =4)
run reducers which reads sorted data (disk access =5)
write reducers output (disk access =6)

TEZ

Irrespective of the tasks it first creates DAG(Directed Acyclic Graph)
It is similar to Spark but developed well before than spark.

Executes the plan but no need to read data from disk.
Once ready to do some calculations, get the data from the disk and perform all the steps and produce the output.
One read and one write

Pros: One read and one write
Efficient as it wont access the disk multiple times and stores intermediate results in memory.
Vectorization is enabled on top of it.

Last but not the least
If the table is partitioned and there are delta files (from updates, for eg.), I think mr works but not tez. You may have to run compaction to convert the delta files into base files and then tez will work.

Problems and Solution

Search This Blog

Which is better Tez or MR

Comments

Post a Comment

Popular posts from this blog

Error handling in hadoop map reduce

LeaseExpired Exception

Handling csv with enclosed doubled quotes and separated by comma