A simple scraper

This workbook uses a python package called 'Beautiful Soup' to pull together information from the collections search page at the Canadian Museum of History.

Useful links, queries

https://programminghistorian.org/en/lessons/intro-to-beautiful-soup

https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=25

https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=200&view=list

This notebook assumes python 3.

If you are viewing this on the course website, this is a static version; Right-click and save as the ipynb file to use on your own machine, or upload to Github for live use via Binder.

In [ ]:
# the first time you run this, you need to install beautiful soup.
# afterwards, you don't.

!pip install beautifulsoup4
In [ ]:
# we need this so that we can grab stuff off the web
import requests
In [ ]:
# target webpage
url = "https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=500&view=list"

# Getting the webpage, creating a Response object.
response = requests.get(url)
 
# Extracting the source code of the page.
data = response.text
 
In [ ]:
from bs4 import BeautifulSoup

#we give that data to BS so that it can extract what we're interested in
soup = BeautifulSoup(data, 'lxml')

print(soup.prettify())

# and lo! the original html of the searh page results
In [ ]:
# so let's find the links to the data we're interested in 
# ie the individual records

# the html tag for a hyperlink is <a>
links = soup.find_all('a')

for link in links:
    print(link)

Pause

Notice that sometimes, in the <a> tags, there are things called class, which tell your browser that this particular tag should be understood as being kinda different from other tags. If you parse the html carefully, you can see that there are some kinds of <a> tags that we might want. The block before iterates through the html looking for the kind that mark off 'collection-item-wrapper' - the direct URL to an item record.

In [ ]:
# so lets get those links

for link in links:
    names = link.contents[0]
    fullLink = link.get('href', {'class': 'collection-item-wrapper'})
    print(names)
    print(fullLink)
In [ ]:
# let's write those links to a file
import csv

f = csv.writer(open("histmuse.csv", "w"))
f.writerow(["Name", "Link"]) # Write column headers as the first line

links = soup.find_all('a')
for link in links:
    names = link.contents[0]
    fullLink = link.get('href')
    # print(names)
    # print(fullLink)
    f.writerow([names, fullLink])

Okay! So that's a bit messy, but you now have a csv that's a bit messy, admittedly, but you could easily clean it up in excel or a text editor so that it just looks like this:

https://www.historymuseum.ca/collections/artifact/1337564/
https://www.historymuseum.ca/collections/artifact/2359060/
https://www.historymuseum.ca/collections/artifact/1316383/
https://www.historymuseum.ca/collections/artifact/2365313/
https://www.historymuseum.ca/collections/artifact/2365193/

If you saved that cleaned up file as urls.txt and then pass that file to wget at the command line, like this:

wget -i urls.txt -r --no-parent -nd -w 2 --limit-rate=100k

you'd bet a folder of data (html pages, in this case).

If you don't have wget installed on your machine, follow the instructions at Programming Historian.

But wait! There's other data from the original search screen we could grab. Look at the original html:

<span class="collection-item-metadata location">Canada</span>
<span class="collection-item-metadata artifact-number">2011.175.11</span>
<span class="collection-item-metadata date-made">1977</span>

Let's grab that.

Modify the code to grab other information of interest

Using what you know, download other kinds of metadata and put it together into a csv. Here are some partial examples to get you started.

In [ ]:
spans = soup.find_all('span', {'class': 'collection-item-metadata artifact-number'})
for span in spans:
    print(span)

Let's grab some images

In [ ]:
trs = soup.find_all('img', {'class': 'collection-item-image'})
for tr in trs:
    print(tr)

Do you see how you could for instance gather a dataset of images of a particular kind of object? Write the complete code in a new cell below. It's ok if you do it in a series of steps.

In [ ]: