I gave a talk titled “Exploratory Data Analysis” @ GNUnify 2014. The intention of the talk was to introduce people to how easy it is to use tools like pandas to visualise data. Also introduce people to some great tools like OpenRefine to clean data.

I used some data I had pulled via twitter stream for with #NowListening. The original idea was to discover new music, but later I realised it was going into the realms of Natural Language Processing, so I stuck to showcasing some analysis of the tweets based on the meta-data of each tweet.

I briefly showcased how using the good old unix tools like sed, awk & GNU Parallel can be used to manipulate data.

I also showed how data from Indian Government Data Portal can be used to see trends in mobile vs landline penetration and usage patterns.

Lastly, used StatsModels, a python module, to show how linear regression is done and how the results change, if you look at partial variables and then use all the variables available. I got the data from the website of the book “An Introduction to Statistical Learning”

I also wanted to showcase the use of OpenRefine. But the organisers, at the last minute, insisted that I have to use a GNU/Linux machine and my Apple Macbook, running proprietary operating system cannot be used for GNU event. I did not have time to move my data and install OpenRefine onto the new box that was given to me.

The slide deck of the talk is available on slideshare. The code (in the form of iPython Notebooks) and data is available on github under the repository Exploratory-Data-Analysis