next up previous contents
Next: Exceptions Up: Using Modules Previous: CGI Environmental Variables   Contents

Accessing Documents on the Web: the urllib module

While the cgi module is useful on the server side of the World Wide Web, the urlopen is useful when developing applications that act as clients to the World Wide Web. While Python can be written to write full scale browsers (see the Grail project at http://grail.sourceforge.net), the urllib module is most useful when writing applications that try to automate interaction with the web.

The core of the module is the urlopen function. This function accepts a URL as an argument, and returns a file-like object which allows you to access the contents of the URL using any of the standard file methods (read, readline, or readlines; see Section 5.4.1 for details). An optional second argument is a url-encoded string (see the urlencode function below) which provides data to be sent to the URL if it is a CGI program. If the URL provided does not begin with a http:// or ftp://, the request is treated as a local file.

As a simple example of the use of the urlopen function, the CNN web site, http://www.cnn.com, displays the top headline in a prominent font; at the time of this writing, the top headline can be indentified as the anchor on a line identified with a class of cnnMainT1Headline. Since the headline is an active link, it is surrounded by anchor tags, i.e. <a> or <A> and </a> or </A>. We can write a regular expression to extract the headline from these tags:

 headlinepat = re.compile(r'<.*cnnMainT1Headline.*><a.*>(.*)</a>',re.I)
All that remains is to access the contents of the page with the urllopen function:
try:
    f = urllib.urlopen('http://www.cnn.com')
except IOError:
    sys.stderr.write("Couldn't connect to CNN website\n")
    sys.exit(1)

contents = f.read()
headline = headlinepat.findall(contents)
print headline[0]
Since findall returns a list, only the first element of the list is printed.

The urlopen function can also be used to post data to a CGI program. The information to be posted can either be embedded in the URL itself, or it can be sent to the URL through headers. In the first case, it is important to make sure that special characters (blanks, punctuation, etc.) are converted to the appropriate codes using the quote function of the urllib module. In the second case, the urlencode function accepts a dictionary of values, and converts them to the appropriate form:

>>> travelplans = {'dest': 'Costa Rica','month': 'Jun','day' : 25}
>>> urllib.urlencode(travelplans)
'month=Jun&day=25&dest=Costa+Rica'
We could contact the fictitious travel agency CGI program in Section 8.11.1 with a program like this one:
urllib.urlopen('http://www.travelagency.com/cgi-bin/query',\
                urllib.urlencode(travelplans))

As a more realistic example, many websites offer stock quotes, by accepting a company's ticker tape symbol as part of a query string specified in their URL. One such example is http://www.quote.com; to display a page of information about a stock with ticker tape symbol xxx, you could point your browser to

http://finance.lycos.com/home/stocks/quotes.asp?symbols=xxx
Examination of the HTML text returned by this URL shows that the current quote is the first bold (i.e. between <b> and </b> tags) text on the line following a line with the time at which the quote was issued. We can extract a current quote for a stock with the following function:
import sys,re,urllib

def getquote(symbol):
    lspat = re.compile('\d?\d:\d\d[ap]m .+T')
    boldpat = re.compile('<b>(.*?)</b>',re.I)
    url = 'http://finance.lycos.com/home/stocks/quotes.asp?symbols=%s' % \
          symbol    
    f = urllib.urlopen(url)
    lastseen = 0
    while 1:
        line = f.readline()
        if not line : break
        if lastseen:
                quote = boldpat.findall(line)[0]
                break
        if lspat.search(line):
                lastseen = 1

    return quote

The syntax of the URLs accepted by urlopen allows embedding a username/password pair or optional port number in the URL. Suppose we wish to access the site http://somesite.com, using user name myname and password secret, through port 8080 instead of the usual default of 80. The following call to urlopen could be used:

urllib.urlopen('http://myname:secret@somesite.com:8080')
A similar scheme can be used to access files on FTP (File Transfer Protocol) servers. For more control over FTP, Python also provides the ftplib module.
next up previous contents
Next: Exceptions Up: Using Modules Previous: CGI Environmental Variables   Contents
Phil Spector 2003-11-12