Skip to main content
Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?
There is no relationship with number of mappers and number of the blocks,
Only in case if the hdfs file does have any duplicate records then sqoop cannot help in resisting the duplicates.
if duplicate rows were seen in target table and when you tried to add PK constraint, it failed due to PK violation, further, the source does not have duplicate rows. One possible scenario is that your Target table could already have records which maybe because of a previous incomplete sqoop job. Please check whether target table has key which is also in source.
One workaround for this scenario is, use parameter "--update-mode allowinsert". In your query, add these parameters, --update-key --update-mode allowinsert. This will ensure that if key is already present in table then the record will get updated else if key is not present then sqoop will do an insert.

Comments

Popular posts from this blog

Error handling in hadoop map reduce

Based on the documentation, there are a few ways, how the error handling is performed in map reduce. Below are the few: a. Custom counters using enum - increment for every failed record. b. Log error and analyze later. Counters give the number of failed records. However to get the identifier of the failed record(may be its unique key), and details of the exception occurred, node on which the error occurred - we need to perform centralized log analysis and there are many nodes running. Logstash is on which is available. Apart from these, are there any other ways to handle the error scenarios, without manual intervention. Any tools, references, best practices are welcome. I think the same technique applies to any distributed applications, with minor changes. Few questions to ask, when working with error handling: Should the job be stopped if an error occurred in data validation. Most of the Big data use cases might be ok to leave few bad records. But if your usecase wa...

Custom SerDe

Steps invoved We start by implementing the  SerDe   interface and setting up the internal state variables needed by other methods. public class ColumnarMapSerDe implements SerDe { private List < String > columnNames; private ObjectInspector objectInspector; private Map < String, String > rowMap; The initialize method is called when a table is created. It is responsible for verifying that the table definition is compatible with the underlying serialization and deserialization mechanism. In this case, we only support strings. @Override public void initialize( Configuration conf, Properties tableProperties) throws SerDeException { final List < TypeInfo > columnTypes = TypeInfoUtils.getTypeInfosFromTypeString( tableProperties.getProperty(LIST_COLUMN_TYPES)); // verify types for (TypeInfo type : columnTypes) { if (!type.getCategory().equals(PRIMITIVE) || The  serialize  method is c...

LeaseExpired Exception

So finally i Could determine What is going on /2018/week1/abc /2018/week2/abc /2018/week3/abc /2018/week1/xyz /2018/week2/xyz /2018/week3/xyz Notice the same file names across different weeks. And then each of these files were being written to HDFS using the DFSClient. So essentially multiple mappers were trying to write the "same file"(observe filename abc, xyz) even though the files were actually different. As the client has to acquire a lease before writing the file and as the client writing the first of the "abc" file was not releasing the lease while during the writing process, the other client trying to write the other "abc" (lets say of the week2)  was  throwing the LeaseExpiredException with the Lease Mismatch Message. But this still does not explain why the client which first acquired the lease for the write did not succeed. I mean i would expect in such a case that the first writers of every such files to succeed.No idea This type ...