Python Web Scraping with lxml
Required Modules
lxml, cssselect
If lxml and cssselect are not already installed, run the following command from a terminal:
pip install lxml cssselect --user
The "--user" switch will install the modules in the user context, instead of globally, and won't require "sudo".
For this post, I'll be using the MacUpdate web site as the site to scrape, which is a curated site of Mac software.
Imports
We begin by defining the modules we'll be using.
import lxml.html as lh
from urllib2 import urlopen
from contextlib import closing
-
urlopen
, from the urllib2 module, allows us to pull down the html from the macupdate website. -
lxml.html
, aliased tolh
for convenience, is used to parse the raw downloaded html into a format from which we can then extract the specific items we want. closing
, fromcontextlib
, allows us to wrap the urlopen call in awith
block, and will automatically close the connection for us. The alternattive would be something like:
f = urlopen('https://www.macupdate.com/')
doc = lh.fromstring(f.read())
f.close()
It's one less thing to think about (closing the handle returned via urlopen
)
Breakdown
We first request the html from the macupdate
website, and save it into a variable called doc
with closing(urlopen('https://www.macupdate.com/')) as f:
doc=lh.fromstring(f.read())
The information that we want to access is contained within a div
with an id
of app-name-container
. So let's tell lxml
via the lh
alias to retrieve only that div and it's contents:
for item in doc.cssselect('div.app-name-container'):
This returns a list of lxml.Element
objects.
The for...in
block allows us to loop over the returned list in order to access the individual information items we want. We then retrieve the child elements for each, whic is returned as a Python list
:
y = item.getchildren()
Finally, using indices, we retrieve the actual content that we want from the list and print it out to the terminal screen:
print 'Name: {}'.format(y[0].text_content().strip())
print 'Price: {}'.format(y[3].text_content().strip())
print 'Size: {}'.format(y[4].text_content().strip())
print 'Description: {}\n'.format(y[5].text_content().strip())
We could just as easily stuff each into a dict
and reference the items that way.
Conclusion
lxml
combined with cssselect
allows us to very easily scrape a website and pull the items that we want.
Full Code Listing
#!/usr/bin/env python
import lxml.html as lh
from urllib2 import urlopen
from contextlib import closing
import sys
with closing(urlopen('https://www.macupdate.com/')) as f:
doc=lh.fromstring(f.read())
for item in doc.cssselect('div.app-name-container'):
y = item.getchildren()
print 'Name: {}'.format(y[0].text_content().strip())
print 'Price: {}'.format(y[3].text_content().strip())
print 'Size: {}'.format(y[4].text_content().strip())
print 'Description: {}\n'.format(y[5].text_content().strip())