Capture Directory Context in Hadoop Mapper

I have been using hadoop for data processing and datawarehousing for a while. One of the problem we encountered was map-reduce framework abstracts the input from files to lines, and thus it’s really difficult to apply logic based on different file or directories. Things got worse when we need to aggregate data across various versions of input sources. After digging in Hadoop source code, here is my solution.

Yet Another Monad Tutorial in 15 Minutes

Functional programming has become popular these days, but unlike object-oriented languages, each FP language is so different from the other. Some of these use strict evaluation while others use lazily evaluated models; tons of new concurrent models were introduced; further more, states are handled differently too.

Haskell, for example, does not have states, but uses its powerful type system to construct the stateful program flow normally used in other languages. As you might guess, Monad is one of the type that does the trick. Defining a Monad type is pretty much like defining a class in an object oriented language. However, Monad can do much more than a class. It’s a type that can be used for exception handling, constructing parallel program workflow or even a parser generator!

By learning Monad, You’ll know a different perspective of how to program, and rethink the composition of data logic beyond the object-oriented programming kingdom.

Convert Utf8 Literals in Java

I thought this problem is already been solved, but it’s not: consider a string like \xe6\x84\x8f\xe6\xb3\x95\xe5\x8d\x8a\xe5\xaf\xbc hello world, how can you transform it to an utf8 encoded string 意法半导 hello world? Note that the string you get is encoded in ascii encoding, not utf8; the original utf8 is transfered into hex literals. I thought that I can use whatever library I found on the first result returned by google, but actually there’s no trivial solution out there on the web.

Java Fast IO Using java.nio API

For modern computing, IO is always a big bottleneck to solve. I recently encounter a problem is to read a 355MB index file to memory, and do a run-time lookup base the index. This process will be repeated by thousands of Hadoop job instances, so a fast IO is a must. By using the java.nio API I sped the process from 194.054 seconds to 0.16 sec! Here’s how I did it.

Process Small Files on Hadoop Using CombineFileInputFormat (2)

Followed the previous article, in this post I ran several benchmarks and tuned the performance from 3 hours 34 minutes to 6 minutes 8 seconds!

Process Small Files on Hadoop Using CombineFileInputFormat (1)

Processing small files is an old typical problem in hadoop; On Stack Overflow it suggested people to use CombineFileInputFormat, but I haven’t found a good step-to-step article that teach you how to use it. So, I decided to write one myself.

My Emacs Setting on Servers

My desktop emacs config is complecated, however I need a minimal config for emacs installed on ubuntu servers. This is my note of how to configure emacs on servers that works for me.

Minimal NodeJS Router

Here comes the problem, you’re prototyping a website that has powerful front-end like EmberJS or AngularJS, and sync JSON data with your NodeJS back-end, but you want your nodejs code to be lite and clean.

You can use some nodeJS framework like restify, expressJS, director or whatever, but is there a way to write a minimal router using regex and switch statements? Yes.

Writing Java Programs on a Remote Server

Recently I started to work on hadoop and big data processing, but I was frustrated on eclipse and the development environment. We run hadoop on a remote cluster, but develop map-reduce programs on laptop. The development cycle was pretty slow because we need to upload the jar for every release. Another thing is Eclipse is too inefficient for a Vim and Emacs hacker like me. Thankfully I’m not the only one who think this way; Eric Van Dewoestine developed Eclim which can let you work on java programs on headless eclipse and vim/emacs! Here comes the installation steps:

Laziness and Memoization in Clojure

I’m now having a job at supplyframe Inc., and luckily I can use Clojure for work! Clojure is a young language created by Rich Hickey on 2007. It uses Lisp syntax, immutable data structures by default, and supports both strict and lazy evaluations. As Christ Okasaki suggested:

Strict evaluation is useful in implementing worst-case data structures and lazy evaluation is useful in implementing amortized data structures.

It’s really cheap to define lazy or strict data structures in Clojure that has low amortized cost even in a persistent manner. Let’s dig into the source code and see how does Clojure implement it.