2008 / urlwatch - a tool for monitoring webpages for updates

urlwatch - a tool for monitoring webpages for updates

This script is intended to help you watch URLs and get notified (via email or in your terminal) of any changes. The change notification will include the URL that has changed and a unified diff of what has changed.

The script supports the use of a filtering hook function to strip trivially-varying elements of a webpage.

Basic features

  • Simple configuration (text file, one URL per line)
  • Easily hackable (clean Python implementation)
  • Can run as a cronjob and mail changes to you
  • Always outputs only plaintext - no HTML mails :)
  • Supports removing noise (always-changing website parts)
  • Example hooks to filter content in Python
  • Uses If-Modified-Since header to save bandwidth (new in 1.9)
  • Convert non-UTF8 web pages to UTF-8 for mail (new in 1.10)
  • Handle non-zero shell exit codes as error (new in 1.11)
[image: urlwatch logo]

Download

Official Debian package (by Franck Joncourt)

Package information: http://packages.debian.org/urlwatch

If you have sid repositories enabled, you can install urlwatch via:

    apt-get install urlwatch

Source tarball

You can download the source tarball of urlwatch here:

Python Package Index

urlwatch is also indexed in the Python Package Index as "urlwatch":

Advanced features

  • Clean up "bad" HTML (long lines, etc..) with python-utidylib
  • Convert iCalendar files (*.ics) to plaintext using ical2text
  • Convert HTML to plaintext using lynx, html2text or a regex
  • Watch output of shell commands (new in 1.9)

3rd party patches / Contributions

License

urlwatch is released under the terms of the BSD license

Code repository

The Git repository of urlwatch now has a more permanent home over at repo.or.cz/w/urlwatch.git.

To checkout the code using git, use this command:

    git clone git://repo.or.cz/urlwatch.git

How do I..

..watch only an element on a website?

If you are lucky, the element has a "id" attribute (but other attributes work just fine as well) that you can use with the BeautifulSoup library to extract that part of the HTML document:

      from BeautifulSoup import BeautifulSoup
      soup = BeautifulSoup(data)
      data = str(soup.find(id='tisiDocumentBody'))

Information about the User-Agent

Since version 1.3, urlwatch now sends a better User-Agent string. More information about this User-agent string can be found on this page.

Thomas Perl (thp at this domain), jabber: thp@jabber.org