Skip to main content

Posts

Custom SerDe

Steps invoved We start by implementing the  SerDe   interface and setting up the internal state variables needed by other methods. public class ColumnarMapSerDe implements SerDe { private List < String > columnNames; private ObjectInspector objectInspector; private Map < String, String > rowMap; The initialize method is called when a table is created. It is responsible for verifying that the table definition is compatible with the underlying serialization and deserialization mechanism. In this case, we only support strings. @Override public void initialize( Configuration conf, Properties tableProperties) throws SerDeException { final List < TypeInfo > columnTypes = TypeInfoUtils.getTypeInfosFromTypeString( tableProperties.getProperty(LIST_COLUMN_TYPES)); // verify types for (TypeInfo type : columnTypes) { if (!type.getCategory().equals(PRIMITIVE) || The  serialize  method is called whenever a ro
Recent posts

Which is better Tez or MR

MR Read the data from the file (disk access =1) Run mappers Write map output (disk access =2) Run shuffle and sort (read intermediate o/p of mapper, disk access =3) write shuffle and sort  (disk access =4) run reducers which reads sorted data (disk access =5) write reducers output (disk access =6) TEZ Irrespective of the tasks it first creates DAG(Directed Acyclic Graph) It is similar to Spark but developed well before than spark. Executes the plan but no need to read data from disk. Once ready to do some calculations, get the data from the disk and perform all the steps and produce the output. One read and one write Pros: One read and one write Efficient as it wont access the disk multiple times and stores intermediate results in memory. Vectorization is enabled on top of it. Last but not the least If the table is partitioned and there are delta files (from updates, for eg.), I think mr works but not tez. You may have to run compaction to convert the delta fil
Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location? There is no relationship with number of mappers and number of the blocks, Only in case if the hdfs file does have any duplicate records then sqoop cannot help in resisting the duplicates. if duplicate rows were seen in target table and when you tried to add PK constraint, it failed due to PK violation, further, the source does not have duplicate rows. One possible scenario is that your Target table could already have records which  maybe  because of a previous incomplete sqoop job. Please check whether target table has key which is also in source. One workaround for this scenario is, use parameter " --update-mode allowinsert ". In your query, add these parameters, --update-key --update-mode allowinsert . This will ensure that if key is already present in table then the record will get updated else if key is not present then sqoop will do an

Handling csv with enclosed doubled quotes and separated by comma

Incase if you need to load the csv file in which fields are terminated by ',' or any thing and it also includes the quotes char in it.  Below is the format we need to follow with SERDE Create table tbl_netflix(show_id int,title string,director string,casting string,country string,date_added date,release_year int,rating string,duration string,listed_in string,description string,type string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (    "separatorChar" = ",",    "quoteChar"     = "\"" )Stored as TextFile;

Error handling in hadoop map reduce

Based on the documentation, there are a few ways, how the error handling is performed in map reduce. Below are the few: a. Custom counters using enum - increment for every failed record. b. Log error and analyze later. Counters give the number of failed records. However to get the identifier of the failed record(may be its unique key), and details of the exception occurred, node on which the error occurred - we need to perform centralized log analysis and there are many nodes running. Logstash is on which is available. Apart from these, are there any other ways to handle the error scenarios, without manual intervention. Any tools, references, best practices are welcome. I think the same technique applies to any distributed applications, with minor changes. Few questions to ask, when working with error handling: Should the job be stopped if an error occurred in data validation. Most of the Big data use cases might be ok to leave few bad records. But if your usecase wa

Downloading a file from jar

Trying to access a file which is in jar file for some requirement Assume that i have a file "test" under src/main/resources inside the "example.jar" We can access the file test by unzipping the jar file unzip -q -j <JAR NAME> <FILENAME> -q option is for the quiet

Unexpected end of ZLIB input stream

You have to call  close()  on the  GZIPOutputStream  before you attempt to read it. The final bytes of the file will only be written when the file is actually closed. (This is irrespective of any explicit buffering in the output stack. The stream only knows to compress and write the last bytes when you tell it to close. A  flush()  probably won't help ... though calling  finish()  instead of  close()  should work. Look at the javadocs.) Below sample code(Don't treat it as good code) String name = "/tmp/test" ; GZIPOutputStream gz = new GZIPOutputStream ( new FileOutputStream ( name )); gz . write ( 10 ); gz . close (); System . out . println ( new GZIPInputStream ( new FileInputStream ( name )). read ());