For modern computing, IO is always a big bottleneck to solve. I recently encounter a problem is to read a 355MB index file to memory, and do a run-time lookup base the index. This process will be repeated by thousands of Hadoop job instances, so a fast IO is a must. By using the
java.nio API I sped the process from 194.054 seconds to 0.16 sec! Here’s how I did it.
The Data to Process
This performance tuning practice is very specific to the data I’m working on, so it’s better to explain the context. We have a long ip list (26 millions in total) that we want to put in the memory. The ip is in text form, and we’ll transform it into signed integer and put it into a java array. (We use signed integer because java doesn’t support unsigned primitive types…) The transformation is pretty straight forward:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
However, reading ip in text form line by line is really slow.
Strategy 1: Line-by-line text processing
This approach is straight forward. Just a standard readline program in java.
1 2 3 4 5 6 7 8 9 10 11 12 13
The result time was
Strategy 2: Encode ip in binary format
The file size of the
ip.tsv is 355MB, which is inefficient to store or to read. Since I’m only reading it to an array, why not store it as a big chunk of binary array, and read it back while I need it? This can be done by DataInputStream and DataOutputStream. After shrinking the file, the file size became 102MB.
Here’s the code to read ip in binary format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The resulting time was
72 seconds. Much slower than I expected.
Strategy 3: Read the file using java.nio API
java.nio is a new IO API that maps to low level system calls. With these system calls we can perform libc operations like
fread, and bulk copy from disk to memory. For the C API you can view it from GNU C library reference.
The terminology in C and Java is a little bit different. In C, you control the file IO by file descriptors; while in
java.nio you use a FileChannel for reading, writing, or manipulate the position in the file. Another difference is you can bulk copy directly using the
fread call, but in Java you need an additional
ByteBuffer layer to map the data. To understand how it work, it’s better to read it from code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
The code should be quite self-documented. The only thing to note is the byte-buffer’s
flip() method. This call convert the buffer from writing data to buffer from disk to reading mode, so that we can read the data to int array via method
get(). Another thing worth to mention is java use big-endian to read and write data by default. You can use
ByteBuffer.order(ByteOrder.LITTLE_ENDIAN) to set the endian if you need it. For more about
ByteBuffer here’s a good blog post that explains it in detail.
With this implementation, the result performance is
0.16 sec! Glory to the