Data Wrangling

Goals for this week

  1. Cleaning data to make it useable
  2. Pushing data to the web

Listen

This week we hear from Fiona Smith Hale, the Chief Knowledge Officer at Ingenium. Also, I reflect on the way this course was built, my hopes for it, and suggest that we dial things back a bit in the light of, well, the world not having gone as planned. That is to say, I had imagined things (gestures wildly) having improved a lot by this point, when the course was built. But we’ve gone back into the Red zone, and everyone’s tired. So let’s dial things back a bit.

Do

If you’re feeling overwhelmed, you can take a break this week. Just continue to play with the notebooks from last week, get a feel for drawing out data into the notebook. You don’t have to do the items below unless you feel so inclined.

Now Optional:
  1. Cleaning Data with R; this notebook works with data that Chantal collected. In Discord, once you’ve finished with this, bounce some ideas around about what you might like to do with this data; you could even go back to the early weeks in the course and look at some of the examples there for ideas. (The point is: every project has to do data cleaning in order to get things to the point where you can begin to explore it more, but in cleaning it, we open some possibilities and close down others. Think about how this impacts GLAM data and research more generally…)
  2. Work with data in tabular format. For this, I’d like you to launch and follow along Melanie Walsh’s excellent four-part tutorial on working with tabular data in Python using the ‘pandas’ data package. To launch the individual parts of that tutorial, look for the rocket ship icon and hit that (or download and run on your own machine).
  3. Go further: put data online with Datasette. Subset some data you’ve already collected from an Ottawa collection. Think about which columns of the data would be useful, which cases (rows) of the data might make a good collection. Don’t include actual image files if you have them; instead use a link to the online location of the images. This will be more challenging than you might first expect, since it requires you to transform the data, documenting your decisions along the way, and then doing command line work to get it into online space you control. You might wish instead to treat this as more of a thought-experiment (noting the things you’d need to learn to achieve it).

I don’t expect you to do these on your own in splendid isolation; I also don’t expect that everyone will complete them. Just push yourself until you get stuck, then talk about it / look for help in Discord.

With tech work, if it doesn’t come together in about 30 minutes, it won’t come in an hour. So take a break. Close the laptop. Call somebody up for help. Find another pair of eyes to look at the problem. I don’t want to hear that you labored heroically for 2 hours to do something. Jump into our social space and ask for advice.

Log your work

For your digital work, it is critical that you keep notes on what works, what doesn’t, what error messages you received, what help you received from others, what websites you went to, and so on.

Create a repository on Github; you can make it private.

Make a text file and call it journal.md. Put the date in it, write brief notes so that when you come back to all of this, you’ll know what you were doing.

Drag and drop this file, and any other supporting materials you wish, onto your repository; once they’ve uploaded, hit the ‘commit’ button.

Share your repo in our Discord space if you want me or someone else to have a look if there are problems - or victories!

While this isn’t graded, per se, you will need this material when it comes to writing the documentation for your eventual GLAM notebook you create. Get in the habit of keeping careful process notes.