Skip to content

Commit 348846e

Browse files
authored
Use wayback package, remove internetarchive module, take 2 (#512)
The `internetarchive` module has been split off as a new package named `wayback` (we’re replacing an older, unmaintained package of the same name). This updates our code here to use it and delete the old `internetarchive` module. Some parts of the `internetarchive` module (namely formatting memento responses as dicts) were not generic, and so have been moved directly into `cli` because they did not make sense as part of the new `wayback` package. They’ll probably be further refactored later. This change was originally done in #511, but that PR was incomplete and had to be reverted. Fixes applied here: - Fixed references to error types. These require importing from a private part of wayback, so we might want to merge and release edgi-govdata-archiving/wayback#7, which provides public access to those, first. - Fixed some broken method calls. - Fix the fact that date got renamed to timestamp on CdxRecord right before the 0.2.0 release. - Add an end-to-end test that runs the whole import process, and covers some of the error cases we encounter from the various clients. It’s not perfect and doesn’t cover all errors, but was a lot quicker to get done than a completely custom test with mock client instances that purposely trigger all the various possible errors.
1 parent 7a83cf9 commit 348846e

16 files changed

+13344
-1617
lines changed

README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,6 @@ This component is intended to hold various backend tools serving different tasks
2323

2424
Working and Under Active Development:
2525

26-
* A Python API to the Internet Archive Wayback Machine's archived webpage
27-
snapshots in ``web_monitoring.internetarchive``
2826
* A Python API to the web-monitoring-db Rails app in ``web_monitoring.db``
2927
* Python functions and a command-line tool for importing snapshots from the
3028
Internet Archive into web-monitoring-db.
@@ -102,7 +100,7 @@ Point your browser or ``curl`` at ``http://localhost:4000``.
102100
103101
## Releases
104102
105-
New releases of the diffing server are published automatically as Docker images by CircleCI when someone pushes to the `release` branch. They are availble at https://hub.docker.com/r/envirodgi/ui. See [web-monitoring-ops](https://github.com/edgi-govdata-archiving/web-monitoring-ops) for how we deploy releases to actual web servers. We do not yet publish releases of the library-style tools in this repo (e.g. `db.py`, `internetarchive.py`).
103+
New releases of the diffing server are published automatically as Docker images by CircleCI when someone pushes to the `release` branch. They are availble at https://hub.docker.com/r/envirodgi/ui. See [web-monitoring-ops](https://github.com/edgi-govdata-archiving/web-monitoring-ops) for how we deploy releases to actual web servers. We do not yet publish releases of the library-style tools in this repo (e.g. `db.py`).
106104
107105
Images are tagged with the SHA-1 of the git commit they were built from. For example, the image `envirodgi/processing:446ae83e121ec8c2207b2bca563364cafbdf8ce0` was built from [commit `446ae83e121ec8c2207b2bca563364cafbdf8ce0`](https://github.com/edgi-govdata-archiving/web-monitoring-processing/commit/446ae83e121ec8c2207b2bca563364cafbdf8ce0) in web-monitoring-processing.
108106

docs/source/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,5 +21,4 @@ developers. Contributions are welcome! See the
2121

2222
installation
2323
db_api
24-
wayback
2524
data_sources

docs/source/wayback.rst

Lines changed: 0 additions & 143 deletions
This file was deleted.

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ toolz ~=0.10.0
1313
tornado ~=6.0.3
1414
tqdm ~=4.40.0
1515
tzlocal ~=2.0
16+
wayback ~=0.2.1

scripts/ia_healthcheck

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ from datetime import datetime, timedelta
99
import random
1010
import sentry_sdk
1111
import sys
12-
from web_monitoring import db, internetarchive
12+
from web_monitoring import db
13+
from wayback import WaybackClient
1314

1415

1516
# The current Sentry client truncates string values at 512 characters. It
@@ -61,14 +62,14 @@ def wayback_has_captures(url, from_date=None):
6162
-------
6263
list of JSON
6364
"""
64-
try:
65-
with internetarchive.WaybackClient() as wayback:
66-
versions = wayback.list_versions(url, from_date=from_date)
65+
with WaybackClient() as wayback:
66+
versions = wayback.search(url, from_date=from_date)
67+
try:
6768
next(versions)
68-
except ValueError:
69-
return False
70-
else:
71-
return True
69+
except StopIteration:
70+
return False
71+
else:
72+
return True
7273

7374

7475
def output_results(statuses):

0 commit comments

Comments
 (0)