{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# A simple scraper \n",
"\n",
"This workbook uses a python package called 'Beautiful Soup' to pull together information from the collections search page at the Canadian Museum of History.\n",
"\n",
"Useful links, queries\n",
"\n",
"https://programminghistorian.org/en/lessons/intro-to-beautiful-soup\n",
" \n",
"https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=25\n",
"\n",
"https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=200&view=list\n",
"\n",
"This notebook assumes python 3. \n",
"\n",
"If you are viewing this on the course website, this is a static version; Right-click and save as the [ipynb file](simple-scraper.ipynb) to use on your own machine, or upload to Github for live use via [Binder](https://binder.org).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# the first time you run this, you need to install beautiful soup.\n",
"# afterwards, you don't.\n",
"\n",
"!pip install beautifulsoup4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we need this so that we can grab stuff off the web\n",
"import requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# target webpage\n",
"url = \"https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=500&view=list\"\n",
"\n",
"# Getting the webpage, creating a Response object.\n",
"response = requests.get(url)\n",
" \n",
"# Extracting the source code of the page.\n",
"data = response.text\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"#we give that data to BS so that it can extract what we're interested in\n",
"soup = BeautifulSoup(data, 'lxml')\n",
"\n",
"print(soup.prettify())\n",
"\n",
"# and lo! the original html of the searh page results\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# so let's find the links to the data we're interested in \n",
"# ie the individual records\n",
"\n",
"# the html tag for a hyperlink is \n",
"links = soup.find_all('a')\n",
"\n",
"for link in links:\n",
" print(link)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pause \n",
"\n",
"Notice that sometimes, in the `` tags, there are things called `class`, which tell your browser that this _particular_ tag should be understood as being kinda different from other tags. If you parse the html carefully, you can see that there are some kinds of `` tags that we might want. The block before iterates through the html looking for the kind that mark off 'collection-item-wrapper' - the direct URL to an item record.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# so lets get those links\n",
"\n",
"for link in links:\n",
" names = link.contents[0]\n",
" fullLink = link.get('href', {'class': 'collection-item-wrapper'})\n",
" print(names)\n",
" print(fullLink)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# let's write those links to a file\n",
"import csv\n",
"\n",
"f = csv.writer(open(\"histmuse.csv\", \"w\"))\n",
"f.writerow([\"Name\", \"Link\"]) # Write column headers as the first line\n",
"\n",
"links = soup.find_all('a')\n",
"for link in links:\n",
" names = link.contents[0]\n",
" fullLink = link.get('href')\n",
" # print(names)\n",
" # print(fullLink)\n",
" f.writerow([names, fullLink])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay! So that's a bit messy, but you now have a csv that's a bit messy, admittedly, but you could easily clean it up in excel or a text editor so that it just looks like this:\n",
"\n",
"```\n",
"https://www.historymuseum.ca/collections/artifact/1337564/\n",
"https://www.historymuseum.ca/collections/artifact/2359060/\n",
"https://www.historymuseum.ca/collections/artifact/1316383/\n",
"https://www.historymuseum.ca/collections/artifact/2365313/\n",
"https://www.historymuseum.ca/collections/artifact/2365193/\n",
"```\n",
"\n",
"If you saved that cleaned up file as `urls.txt` and then pass that file to wget at the command line, like this:\n",
"\n",
"`wget -i urls.txt -r --no-parent -nd -w 2 --limit-rate=100k`\n",
"\n",
"you'd bet a folder of data (html pages, in this case).\n",
"\n",
"If you don't have wget installed on your machine, follow the instructions at [Programming Historian](https://programminghistorian.org/en/lessons/automated-downloading-with-wget)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But wait! There's other data from the original search screen we could grab. Look at the original html:\n",
"\n",
"```html\n",
"Canada\n",
"2011.175.11\n",
"1977\n",
"```\n",
"\n",
"Let's grab that."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modify the code to grab other information of interest\n",
"\n",
"Using what you know, download other kinds of metadata and put it together into a csv. Here are some partial examples to get you started."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"spans = soup.find_all('span', {'class': 'collection-item-metadata artifact-number'})\n",
"for span in spans:\n",
" print(span)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's grab some images"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trs = soup.find_all('img', {'class': 'collection-item-image'})\n",
"for tr in trs:\n",
" print(tr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do you see how you could for instance gather a dataset of images of a particular kind of object? Write the complete code in a new cell below. It's ok if you do it in a series of steps."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}