Configuring such that avoid too many log files

Configuring Hadoop logging to avoid too many log files

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: Error in Hadoop MapReduce

My question is: does anyone know how to configure Hadoop to roll the log dir or otherwise prevent this? I'm trying to avoid just setting the "mapred.userlog.retain.hours" and/or "mapred.userlog.limit.kb" properties because I want to actually keep the log files.

I was also hoping to configure this in log4j.properties, but looking at the Hadoop 0.20.2 source, it writes directly to logfiles instead of actually using log4j. Perhaps I don't understand how it's using log4j fully.

Solution:

Unfortunately, there isn't a configurable way to prevent that. Every task for a job gets one directory in history/userlogs, which will hold the stdout, stderr, and syslog task log output files. The retain hours will help keep too many of those from accumulating, but you'd have to write a good log rotation tool to auto-tar them.

We had this problem too when we were writing to an NFS mount, because all nodes would share the same history/userlogs directory. This means one job with 30,000 tasks would be enough to break the FS. Logging locally is really the way to go when your cluster actually starts processing a lot of data.

If you are already logging locally and still manage to process 30,000+ tasks on one machine in less than a week, then you are probably creating too many small files, causing too many mappers to spawn for each job.

Solution2:

Our solution is to modify our data collection process to concatenate files before running any jobs.

Solution3:

Set the environment variable "HADOOP_ROOT_LOGGER=WARN,console" before starting Hadoop.

export HADOOP_ROOT_LOGGER="WARN,console"
hadoop jar start.jar

Problems and Solution

Search This Blog

Configuring such that avoid too many log files

Configuring Hadoop logging to avoid too many log files

Comments

Post a Comment

Popular posts from this blog

Error handling in hadoop map reduce

Custom SerDe

Handling csv with enclosed doubled quotes and separated by comma