I had done a small data mining project for a client sometime back. He had supplied me with some publicly available data.

Since the data was still lying around with me and it being public data, I decided to play around with it. The raw data is in CSV type plain text format. A years data has close to 60k+ text files totalling to about 4 GB in disk space. 4 GB data does not fall into the category of “Big-Data”, but trying to do a simple

mv <src>/* <dst>

results into an error

-bash: mv: Argument list too long

A simple solution can be

for f in *
do
mv <src>/$f <dst>
done

This is a sequential, single thread operation, so step in GNU Parallel. Now I can do the same, but this time utilising all the available cores and in parallel.

ls | parallel mv {} <dst>

I wanted to count the total number of lines in about 60k+ text files.

cat * | parallel  --pipe wc -l | awk '{s+=$1} END {print s}'

The thing that I noticed was doing a cat * on 60k+ files on Linux works. Attempting the same in OS X does not work. I guess had not been a OS X user, I might not have discovered GNU Parallel!

For more such examples do visit the GNU Parallel Tutorial

Note: More examples will be added as I keep working