Fork me on GitHub

An Introduction to Data Wrangling

A introduction to data wrangling covers obtaining, cleaning and using data. It’s organized as a series of simple tasks that you work through.

Task 1: Finding a Question

Data isn’t an end in itself. We usually want data to help us answer some question or help us do some activity.

So your first task will be to think of some question that you’d like to answer that could be answered by getting hold of some data.

Examples:

  • How many people are without clean drinking water in the world this year?
  • How much did my country spend on defence last year?
  • What is a normal weight for someone my age?

You can find more questions, and requests for specific datasets, on http://getthedata.org/

For the purposes of exposition, during the rest of this course we are going to focus on a specific example question shown below. Of course, you should focus on your own question.

Our question: How much financial support did the US government give to banks and other companies during the financial crisis of 2008-2009?

Task 2: Finding Data to Answer your Question

Now’s it time to go and find data to answer your question.

We’ll focus on our own example question but you can use the same techniques for your own question.

Our question: How much financial support did the US government give to banks and other companies during the financial crisis of 2008-2009?

In locating relevant data you have two options

  • Use a standard web search engine like Google, Yahoo etc
  • Go directly to a specific service that allows you to search data relevant to your topic. For example if you know what you are looking for is likely to be in government statistics you can go directly to your official statistics agency site. Alternatively you can go to a dedicated data hub like http://thedatahub.org/. We usually recommend option 1 because a search on a generic engine will usually find datasets in more specific systems.

So in our case, we’ll begin with a search on google:

“US government bailout data”

Using google or another search engine to find data is something of an art but the general approach is to start with a basic search and follow links using those results either directly to find data or to refine your results.

So in our case it becomes apparent that the US government have released official data. One link obtained is:

http://finacialstability.org

Which as of June 2011 redirects to:

http://www.treasury.gov/initiatives/financial-stability

On LHS sidebar there is a Investment Programs option which gives a list of investments. We have hit the motherlode. However, there are also some summary sites that seem to have consolidated versions of the data. For example, via http://getthedata.org/questions/218/size-and-terms-of-us-government-support-for-insurer-aig we have:

http://subsidyscope.org/bailout/tarp/