I have been working on Hadoop in production for a while. Here are some of the performance tuning tips I learned from work. Many of my tasks had performance improved over 50% in general. Those guide lines work perfectly in my work place; hope it can help you as well.
Tuning Hadoop run-time parameters
Hadoop provides a set of options on cpu, memory, disk, and network for performance tuning. Most hadoop tasks are not cpu bounded, what we usually look into is to optimize usage of memory and disk spills.
Memory tuning
The general rule for memory tuning is: use as much memory as you can, but don’t trigger swapping. The parameter you can set for task memory is mapred.child.java.opts
. You can put it in your configuration file.
1 2 3 4 |
|
You can tune the best parameters for memory by monitoring memory usage on server using Ganglia, Cloudera manager, or Nagios. Cloudera has a slide focused on memory usage tuning, the link is here
Minimize the map disk spill
Disk IO is usually the performance bottleneck. There are a lot of parameters you can tune for minimizing spilling. What I use the most are:
- compress mapper output
- Use 70% of heap memory for spill buffer in mapper
In your configuration file, you can write:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Although you can further tune reducer buffer, mapper sort record percent, and various of stuff, I found the best thing to do is reduce the mapper output size. Most of the time, the performance is fast enough after I refactor the mapper to output as little data as possible. For more information, check the same cloudera’s performance tuning guide.
Tuning mapper tasks
Unlike reducer tasks which you can specify the number of reducer, the number of mapper tasks is set implicitly. The tuning goal for the mapper is control the amount of mapper and the size of each job. When dealing with large files, hadoop split the file in to smaller chunk so that mapper can run it in parallel. However, the initializing new mapper job usually takes few seconds, this is also a overhead that we want to minimize. These are the things you can do:
-
Reuse jvm task
-
If the average mapper running time is shorter than one minute, you can increase the
mapred.min.split.size
, so that less mappers are allocated in slot and thus reduces the mapper initializing overhead. -
Use Combine file input format for bunch of smaller files. I had an implementation that also use
mapred.min.split.size
to implicitly control the mapper size. You can find the project on github. The explanation of the project can be found on my blog.
The configuration file would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Use configuration file and command line arguments to set parameters
When I first started on hadoop, I setup those parameters in java program, but it is so hard-coded and inflexible. Thankfully, hadoop provides Tool
interface and ToolRunner
class to parse those parameters for you. Here’s a sample program:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
If your main class implements the interface, your program can take the config file as input:
1
|
|
You can even pass extra parameters through command line like this:
1
|
|
Setting configuration as run-time arguments make you easier to test different parameters without recompile the program.
Tuning application-specific performance
Beyond general hadoop parameter setup, you can optimize your map-reduce program using some small tricks. Here are the tricks that I used the most.
Minimize your mapper output
Recall that mapper spill size is a serious performance bottleneck. The size of mapper output is sensitive to disk IO, network IO, and memory sensitive on shuffle phase. Minimizing the mapper output can improve the general performance a lot.
To do this, you can try the following
-
Filter out records on mapper side, not on reducer side.
-
Use minimal data to form your map output key and map output value.
-
Extends
BinaryComparable
interface or use Text for your map output key -
Set mapper output to be compressed
Above all the optimization tips, I found this make the biggest change to many of my tasks, unless I can’t find a smaller key to reduce the mapper output.
Balancing reducer’s loading
Another common performance issue that you might encounter is unbalanced reducer tasks: one or several reducer takes most of the output from mapper and ran extremely long compare to other reducers.
To solve this, you can either
-
Implement a better hash function in
Partitioner
class. -
If you know what keys are causing the issue, you can write a preprocess job to separate keys using MultipleOutputs. Then use another map-reduce job to process the special keys that cause the problem.
Conclusion
It’s fun to write raw map-reduce jobs because it gives you more precise control over performance tuning. If you already experienced hive or pig, I encourage you to try how to optimize the same job using raw map-reduce. You can find a lot of performance gain and more space to tune the performance. For more curious, you can also check the Yahoo’s tuning hadoop performance guides.