Python Web Scraping with lxml

Python Web Scraping with lxml

Required Modules

lxml, cssselect

If lxml and cssselect are not already installed, run the following command from a terminal:

pip install lxml cssselect --user

The "--user" switch will install the modules in the user context, instead of globally, and won't require "sudo".

For this post, I'll be using the MacUpdate web site as the site to scrape, which is a curated site of Mac software.

Imports

We begin by defining the modules we'll be using.

import lxml.html as lh
from urllib2 import urlopen
from contextlib import closing
  1. urlopen, from the urllib2 module, allows us to pull down the html from the macupdate website.

  2. lxml.html, aliased to lh for convenience, is used to parse the raw downloaded html into a format from which we can then extract the specific items we want.

  3. closing, from contextlib, allows us to wrap the urlopen call in a with block, and will automatically close the connection for us. The alternattive would be something like:

f = urlopen('https://www.macupdate.com/')
doc = lh.fromstring(f.read())
f.close()

It's one less thing to think about (closing the handle returned via urlopen)

Breakdown

We first request the html from the macupdate website, and save it into a variable called doc

with closing(urlopen('https://www.macupdate.com/')) as f:
    doc=lh.fromstring(f.read())

The information that we want to access is contained within a div with an id of app-name-container. So let's tell lxml via the lh alias to retrieve only that div and it's contents:

for item in doc.cssselect('div.app-name-container'):

This returns a list of lxml.Element objects.

The for...in block allows us to loop over the returned list in order to access the individual information items we want. We then retrieve the child elements for each, whic is returned as a Python list:

y = item.getchildren()

Finally, using indices, we retrieve the actual content that we want from the list and print it out to the terminal screen:

    print 'Name: {}'.format(y[0].text_content().strip())
    print 'Price: {}'.format(y[3].text_content().strip())
    print 'Size: {}'.format(y[4].text_content().strip())
    print 'Description: {}\n'.format(y[5].text_content().strip())

We could just as easily stuff each into a dict and reference the items that way.


Conclusion

lxml combined with cssselect allows us to very easily scrape a website and pull the items that we want.

Full Code Listing

#!/usr/bin/env python

import lxml.html as lh
from urllib2 import urlopen
from contextlib import closing
import sys

with closing(urlopen('https://www.macupdate.com/')) as f:
    doc=lh.fromstring(f.read())

for item in doc.cssselect('div.app-name-container'):
    y = item.getchildren()
    print 'Name: {}'.format(y[0].text_content().strip())
    print 'Price: {}'.format(y[3].text_content().strip())
    print 'Size: {}'.format(y[4].text_content().strip())
    print 'Description: {}\n'.format(y[5].text_content().strip())