Steps invoved We start by implementing the SerDe interface and setting up the internal state variables needed by other methods. public class ColumnarMapSerDe implements SerDe { private List < String > columnNames; private ObjectInspector objectInspector; private Map < String, String > rowMap; The initialize method is called when a table is created. It is responsible for verifying that the table definition is compatible with the underlying serialization and deserialization mechanism. In this case, we only support strings. @Override public void initialize( Configuration conf, Properties tableProperties) throws SerDeException { final List < TypeInfo > columnTypes = TypeInfoUtils.getTypeInfosFromTypeString( tableProperties.getProperty(LIST_COLUMN_TYPES)); // verify types for (TypeInfo type : columnTypes) { if (!type.getCategory().equals(PRIMITIVE) || The serialize method is called whenever a ro
MR Read the data from the file (disk access =1) Run mappers Write map output (disk access =2) Run shuffle and sort (read intermediate o/p of mapper, disk access =3) write shuffle and sort (disk access =4) run reducers which reads sorted data (disk access =5) write reducers output (disk access =6) TEZ Irrespective of the tasks it first creates DAG(Directed Acyclic Graph) It is similar to Spark but developed well before than spark. Executes the plan but no need to read data from disk. Once ready to do some calculations, get the data from the disk and perform all the steps and produce the output. One read and one write Pros: One read and one write Efficient as it wont access the disk multiple times and stores intermediate results in memory. Vectorization is enabled on top of it. Last but not the least If the table is partitioned and there are delta files (from updates, for eg.), I think mr works but not tez. You may have to run compaction to convert the delta fil