Data Curation

This assignment will be the first part of your class project. I want you to download the raw data from the appropriate source(s) and make it ‘ready’ for the final analysis.

For this you can choose to work on the same topic as you chose for Assignment 1 or chose an entirely new topic. In either case, I would like you to reach a stage from which you can undertake the analysis and get meaningful results.

Let me structure the problem as follows. I will need you to submit two things:

  1. A report describing the progress
  2. An Excel (or CSV format) sheet with the curated data.

The report needs to include:

  1. The problem selected (it can be the same as Assignment 1 or different)
  2. The model being explored
  3. Variables used: (input and output) and their category (e.g. numeric, categorical, ordered)
  4. Data source for each variable (URL, library, paper etc.)
  5. Curation process: what did you need to do to curate the data? Show snippets of the raw data.

Please be direct in answering the questions. I expect this report to be 1-2 pages for most people.

The Excel data sheet should include a table with the input and output variable. It could look something like this.

Try to have between 2 and 7 input variables and between 30 and 10,000 rows of data. If your data requires lot of manual effort I would suggest sticking to 30 rows. In case can find a lot of data with minimal additional effort then try to go for larger number of rows.

Please submit your work via the dropbox as you did for the earlier two assignments. 

