Web scraping with python (part 1 : crawling)

Example One : I am looking for my next job. So I subscribe to many job sites in order to receive notifications by email of new job ads (example = Monster…). But I’d rather check these in my RSS aggregator instead of my mailbox. Or in some sort of aggregating Web platform. Thus, I would be able to do many filtering/sorting/ranking/comparison operations in order to navigate through these numerous job ads.

Example Two : I want to buy a digital camcorder. So I want to compare the available models. Such a comparison implies that I rank the most common models according to their characteristics. Unfortunately, the many sites providing reviews or comparisons of camcorders are not often comprehensive and they don’t offer me the capability of comparing them with respect to my way of ranking and weighting the camcorder features (example = dvspot). So I would prefer pumping all the technical stuff from these sites and manipulate this data locally on my computer. Unfortunately, this data is merged within HTML. And it may be complex to extract it automatically from all the presentation code.

These are common situations : interesting data spread all over the web and merged in HTML presentation code. How to consolidate this data so that you can analyze and process it with your own tools ? In some near future, I expect this data will be published so that it is directly processable by computers (this is what the Semantic Web is intending to do). For now, I was used to do it with Excel (importing Web data, then cleaning it and the like) and I must admit that Excel is fairly good at it. But I’d like some more automation for this process. I’d like some more scripting for this operation so that I don’t end with inventing complex Excel macros or formulas just to automate Web site crawling, HTML extraction and data cleaning. With such an itch to scratch, I tried to address this problem with python.

This series of messages introduces my current hacks that automate web sites crawling and data extraction from HTML pages. The current output of these scripts is a bunch of CSV files that can be further processed … in Excel. I wish I would output RDF instead of CSV. So there remains much room for further improvement (see RDF Web Scraper for a similar but approach). Anyway… Here is part One : how to crawl complex web sites with Python ?. The next part will deal with data extraction from the retrieved web pages, involving much HTML cleansing and parsing.

My crawlers are fully based on the John L. Lee’s mechanize framework for python. There are other tools available in Python. And several other approaches are available when you want to deal with automating the crawling of web sites. Note that you can also try to scrape the screens of legacy terminal-based applications with the help of python (this is called « screen scraping »). Some approaches of web crawling automation rely on recording the behaviour of a user equipped with a web browser and then reproduce this same behaviour in an automated session. That is an attractive and futuristic approach. But this implies that you find a way to guess what the intended automatic crawling behaviour will be from a simple example. In other words, with this approach, you have either to ask the user to click on every web link (all the job postings…) and this gives no value to the automation of the task. Or your system « guesses » what automatic behaviour is expected just by recording a sample of what a human agent would do. Too complex… So I preferred a more down-to-earth solution implying that you write simple crawling scripts « by hand ». (You may still be interested in automatically record user sessions in order to be more productive when producing your crawling scripts.) As a summary : my approach is fully based on mechanize so you may consider the following code as example of uses of mechanize in « real-world » situations.

For purpose of clarity, let’s first focus on the code part that is specific to your crawling session (to the site you want to crawl) . Let’s take the example of the dvspot.com site which you may try to crawl in order to download detailed description of camcorders :

    # Go to home page
    #
    b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0")
    #
    # Navigate through the paginated list of cameras
    #
    next_page = 0
    while next_page == 0:
     #
     # Display and save details of every listed item
     #
     url = b.response.url
     next_element = 0
     while next_element >= 0:
      try:
       b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)
       next_element = next_element + 1
       print save_response(b,"dvspot_camera_"+str(next_element))
       # go back to home page
       b.open(url)
       # if you crawled too many items, stop crawling
       if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION:
          next_element = -1
          next_page = -1
      except LinkNotFoundError:
       # You certainly reached the last item in this page
       next_element = -1
    #
     try:
      b.open(url)
      b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)
      print "processing Next Page"
     except LinkNotFoundError:
      # You reached the last page of the listing of items
      next_page = -1

You noticed that the structure of this code (conditional loops) depends on the organization of the site you are crawling (paginated results, …). You also have to specify the rule that will trigger « clicks » from your crawler. In the above example, your script first follows every link containing « cameraDetail » in its URL (url_regex). Then it follows every link containing « Next Page » in the hyperlink text (text_regex).

This kind of script is usually easy to design and write but it can become complex when the web site is improperly designed. There are two sources of difficulties. The first one is bad HTML. Bad HTML may crash the mechanize framework. This is the reason why you often have to pre-process the HTML either with the help of a HTML tidying library or with simple but string substitutions when your tidy library breaks the HTML too much (this may be the case when the web designer improperly decided to used nested HTML forms). Designing the proper HTML pre-processor for the Web site you want to crawl can be tricky since you may have to dive into the faulty HTML and the mechanize error tracebacks in order to identify the HTML mistakes and workaround them. I hope that future versions of mechanize would implement more robust HTML parsing capabilities. The ideal solution would be to integrate the Mozilla HTML parsing component but I guess this will be some hard work to do. Let’s cross our fingers.

Here are useful examples of pre-processors (as introduced by some other mechanize users and developpers) :

class TidyProcessor(BaseProcessor):
      def http_response(self, request, response):
          options = dict(output_xhtml=1,
                   add_xml_decl=1,
                   indent=1,
                   output_encoding='utf8',
                   input_encoding='latin1',
                   force_output=1
                   )
          r = tidy.parseString(response.read(), **options)
          return FakeResponse(response, str(r))
      https_response = http_response
#
class MyProcessor(BaseProcessor):
      def http_response(self, request, response):
          r = response.read()
          r = r.replace('"image""','"image"')
          r = r.replace('"','"')
          return FakeResponse(response, r)
      https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())

The second source of difficulties comes from non-RESTful sites. As an example the APEC site (a French Monster-like job site) is based on a proprietary web framework that implies that you cannot rely on links URLs to automate your browsing session. It took me some time to understand that, once loggin in, every time you click on a link, you are presented with a new frameset referring to the URLs that contain the interesting data you are looking for. And these URLs seem to be dependent on your session. No permalink, if you prefer. This makes the crawling process even more tricky. In order to deal with this source of difficulty when you write your crawling script, you have to open both your favorite text editor (to write the script) and your favorite web browser (Firefox of course !). One key knowledge is to know mechanize « find_link » capabilities. These capabilities are documented in _mechanize.py source code, in the find_link method doc strings. They are the arguments you will provide to b.follow_link in order to automate your crawler « clicks ». For more convenience, let me reproduce them here :

text: link text between link tags: <a href= »blah »>this bit</a> (as
returned by pullparser.get_compressed_text(), ie. without tags but
with opening tags « textified » as per the pullparser docs) must compare
equal to this argument, if supplied
text_regex: link text between tag (as defined above) must match the
regular expression object passed as this argument, if supplied
name, name_regex: as for text and text_regex, but matched against the
name HTML attribute of the link tag
url, url_regex: as for text and text_regex, but matched against the
URL of the link tag (note this matches against Link.url, which is a
relative or absolute URL according to how it was written in the HTML)
tag: element name of opening tag, eg. « a »
predicate: a function taking a Link object as its single argument,
returning a boolean result, indicating whether the links
nr: matches the nth link that matches all other criteria (default 0)

Links include anchors (a), image maps (area), and frames (frame,iframe).

Enough with explanations. Now comes the full code in order to automatically download camcorders descriptions from dvspot.com. I distribute this code here under the GPL (legally speaking, I don’t own the copyleft of this entire code since it is based on several snippets I gathered from the web and wwwsearch mailing list). Anyway, please copy-paste-taste !

from mechanize import Browser,LinkNotFoundError
from ClientCookie import BaseProcessor
from StringIO import StringIO
# import tidy
#
import sys
import re
from time import gmtime, strftime
#
# The following two line is specific to the site you want to crawl
# it provides some capabilities to your crawler for it to be able
# to understand the meaning of the data it is crawling ;
# as an example for knowing the age of the crawled resource
#
from datetime import date
# from my_parser import parsed_resource
#
"""
 Let's declare some customized pre-processors.
 These are useful when the HTML you are crawling through is not clean enough for mechanize.
 When you crawl through bad HTML, mechanize often raises errors.
 So either you tidy it with a strict tidy module (see TidyProcessor)
 or you tidy some errors you identified "by hand" (see MyProcessor).
 Note that because the tidy module is quite strict on HTML, it may change the whole
 structure of the page you are dealing with. As an example, in bad HTML, you may encounter
 nested forms or forms nested in tables or tables nested in forms. Tidying them may produce
 unintended results such as closing the form too early or making it empty. This is the reason
 you may have to use MyProcessor instead of TidyProcessor.
"""
#
class FakeResponse:
      def __init__(self, resp, nudata):
          self._resp = resp
          self._sio = StringIO(nudata)
#
      def __getattr__(self, name):
          try:
              return getattr(self._sio, name)
          except AttributeError:
              return getattr(self._resp, name)
#
class TidyProcessor(BaseProcessor):
      def http_response(self, request, response):
          options = dict(output_xhtml=1,
                   add_xml_decl=1,
                   indent=1,
                   output_encoding='utf8',
                   input_encoding='latin1',
                   force_output=1
                   )
          r = tidy.parseString(response.read(), **options)
          return FakeResponse(response, str(r))
      https_response = http_response
#
class MyProcessor(BaseProcessor):
      def http_response(self, request, response):
          r = response.read()
          r = r.replace('"image""','"image"')
          r = r.replace('"','"')
          return FakeResponse(response, r)
      https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())
#
""""
 Let's declare some utility methods that will enhance mechanize browsing capabilities
"""
#
def find(b,searchst):
    b.response.seek(0)
    lr = b.response.read()
    return re.search(searchst, lr, re.I)
#
def save_response(b,kw='file'):
    """Saves last response to timestamped file"""
    name = strftime("%Y%m%d%H%M%S_",gmtime())
    name = name + kw + '.html'
    f = open('./'+name,'w')
    b.response.seek(0)
    f.write(b.response.read())
    f.close
    return "Response saved as %s" % name
#
"""
Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl.
"""
#
def dvspot_crawl():
    """
     Here starts the browsing session.
     For every move, I could have put as a comment an equivalent PBP command line.
     PBP is a nice scripting layer on top of mechanize.
     But it does not allow looping or conditional browsing.
     So I preferred scripting directly with mechanize instead of using PBP
     and then adding an additional layer of scripting on top of it.
    """
#
    MAX_NR_OF_ITEMS_PER_SESSION = 500
    #
    # Go to home page
    #
    b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0")
    #
    # Navigate through the paginated list of cameras
    #
    next_page = 0
    while next_page == 0:
     #
     # Display and save details of every listed item
     #
     url = b.response.url
     next_element = 0
     while next_element >= 0:
      try:
       b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)
       next_element = next_element + 1
       print save_response(b,"dvspot_camera_"+str(next_element))
       b.open(url)
       # if you crawled too many items, stop crawling
       if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION:
          next_element = -1
          next_page = -1
      except LinkNotFoundError:
       # You reached the last item in this page
       next_element = -1
    #
     try:
      b.open(url)
      b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)
      print "processing Next Page"
     except LinkNotFoundError:
      # You reached the last page of the listing of items
      next_page = -1
    #
    return
#
#
#
if __name__ == '__main__':
#
    """ Note that you may need to specify your proxy first.
    On windows, you do :
    set HTTP_PROXY=http://proxyname.bigcorp.com:8080
    """
    #
    dvspot_crawl()

In order to run this code, you will have to install mechanize 0.0.8a, pullparser 0.0.5b, clientcookie 0.4.19, clientform 0.0.16 and utidylib. I used Python 2.3.3. Latest clientcookie’s version was to be integrated into Python 2.4 I think. In order to install mechanize, pullparser, clientcookie and clientform, you just have to do the usual way :

python setup.py build
python setup.py install
python setup.py test

Last but not least : you should be aware that you may be breaking some terms of service from the website you are trying to crawl. Thanks to dvspot for providing such valuable camcorders data to us !

Next part will deal with processing the downloaded HTML pages and extract useful data from them.

26 réflexions sur « Web scraping with python (part 1 : crawling) »

Ping : AkaSig » Blog Archive » Web scraping with Python (part II)
Roman 16/03/05 à

Maybe you might get some additional ideas from this commerical product:

Lixto Visual Wrapper

This component is responsible for locating and extracting desired data from the Web. While working with Lixto Visual Wrapper, an operator generates a wrapper program which converts relevant information from HTML to XML. Lixto Visual Wrapper generates a so-called XML companion for given Web pages.

http://www.lixto.com/show.php?page=vw_architecture&lg=EN

Cheers,
Roman
Phil 16/03/05 à

I’ve used Mechanoid, a fork of Mechanize, with success. (I can’t remember why I chose it over Mechanoid, but just thought it was worth pointing out.)

http://mechanoid.sourceforge.net/

–Phil.
Sig Auteur de l’article16/03/05 à

Thank you Phil and Roman for your tips.

Phil, if ever you remember why you thought that Mechanoid was better than Mechanize, please tell us !

— Sig
Nick 20/03/05 à

Source code is messed up a bit by blog engine. All quote symbols were hide by oblique stroke.
Sig Auteur de l’article21/03/05 à

Thank you Nick, it seems that the upgrade process of my weblog engine was not that smooth. I’ll have to fix that… :-(
Lethalman 03/04/05 à

I’m using mechanize too and i would like to try mechanoid but i didn’t find examples on how to use it… does someone know where to find very simple examples to use mechanoid?
agb 05/08/05 à

I think the reason for using mechanoid is that it seems to be more self contained. Looking at the mechanize dox, install consists of a few more packages in addition to mechanize. So, mechanoid seems easier.

But this is the perspective of a guy that’s researching session scraping techniques and has yet to use either package, so ymmv.
Alex 18/08/05 à

Take a look at SWExplorerAutomation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer.
Sig Auteur de l’article23/08/05 à

Alex, I think your SWEA is great. Your screencast is really impressive. I wish we had an equivalent capability to use Mozilla’s Gecko engine from Python. People preferring GUIs to raw python scripts should definitely have a look at SWEA. Thanks for your feedback, Alex.
Ping : AkaSig » Blog Archive » Web scraping, web mashing
Joe 16/03/07 à

BaseProcessor seems to be causing problems in the code. Does it still exist? Is there a workaround?
Joe 16/03/07 à

Tweaked Code to work with new versions of scripts. Working as of March 15 2007:

from mechanize import Browser,LinkNotFoundError from ClientCookie import BaseHandler from StringIO import StringIO # import tidy # import sys import re from time import gmtime, strftime # # The following two line is specific to the site you want to crawl # it provides some capabilities to your crawler for it to be able # to understand the meaning of the data it is crawling ; # as an example for knowing the age of the crawled resource # from datetime import date # from my_parser import parsed_resource # """ Let's declare some customized pre-processors. These are useful when the HTML you are crawling through is not clean enough for mechanize. When you crawl through bad HTML, mechanize often raises errors. So either you tidy it with a strict tidy module (see TidyProcessor) or you tidy some errors you identified "by hand" (see MyProcessor). Note that because the tidy module is quite strict on HTML, it may change the whole structure of the page you are dealing with. As an example, in bad HTML, you may encounter nested forms or forms nested in tables or tables nested in forms. Tidying them may produce unintended results such as closing the form too early or making it empty. This is the reason you may have to use MyProcessor instead of TidyProcessor. """ # class FakeResponse: def __init__(self, resp, nudata): self._resp = resp self._sio = StringIO(nudata) # def __getattr__(self, name): try: return getattr(self._sio, name) except AttributeError: return getattr(self._resp, name) # class TidyProcessor(BaseHandler): def http_response(self, request, response): options = dict(output_xhtml=1, add_xml_decl=1, indent=1, output_encoding='utf8', input_encoding='latin1', force_output=1 ) r = tidy.parseString(response.read(), **options) return FakeResponse(response, str(r)) https_response = http_response # class MyProcessor(BaseHandler): def http_response(self, request, response): r = response.read() r = r.replace('"image""','"image"') r = r.replace('"','"') return FakeResponse(response, r) https_response = http_response # # Open a browser and optionally choose a customized HTML pre-processor b = Browser() b.add_handler(MyProcessor()) # """" Let's declare some utility methods that will enhance mechanize browsing capabilities """ # def find(response,searchst): response.seek(0) lr = response.read() return re.search(searchst, lr, re.I) # def save_response(response,kw='file'): """Saves last response to timestamped file""" name = strftime("%Y%m%d%H%M%S_",gmtime()) name = name + kw + '.html' f = open('./'+name,'w') response.seek(0) f.write(response.read()) f.close return "Response saved as %s" % name # """ Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl. """ # def dvspot_crawl(): """ Here starts the browsing session. For every move, I could have put as a comment an equivalent PBP command line. PBP is a nice scripting layer on top of mechanize. But it does not allow looping or conditional browsing. So I preferred scripting directly with mechanize instead of using PBP and then adding an additional layer of scripting on top of it. """ # MAX_NR_OF_ITEMS_PER_SESSION = 500 # # Go to home page # b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0") # # Navigate through the paginated list of cameras # next_page = 0 while next_page == 0: # # Display and save details of every listed item # url = b.geturl() next_element = 0 while next_element >= 0: try: response1 = b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element) next_element = next_element + 1 print save_response(response1,"dvspot_camera_"+str(next_element)) b.open(url) # if you crawled too many items, stop crawling if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION: next_element = -1 next_page = -1 except LinkNotFoundError: # You reached the last item in this page next_element = -1 # try: b.open(url) response2 = b.follow_link(text_regex=re.compile(r"Next Page"), nr=0) print "processing Next Page" except LinkNotFoundError: # You reached the last page of the listing of items next_page = -1 # return # # # if __name__ == '__main__': # """ Note that you may need to specify your proxy first. On windows, you do : set HTTP_PROXY=http://proxyname.bigcorp.com:8080 """ # dvspot_crawl()
Sig Auteur de l’article11/07/07 à

Hey Joe, this is great! Thanks a lot for that update.
Joe Elizondo 11/07/07 à

No problem, if you ever make a version of the parser working with the newest beautiful soup release let me know! Still trying to tweak that in my spare time.
sunny 25/08/08 à

can i have the code for PHP
Sig Auteur de l’article25/08/08 à

Sorry : no PHP version available unless you code one ! :)
mehdi 16/09/08 à

I want a sdk or dll that convert speech to phonetic such as mbrola that convert phonetic to speech . Thanks a lot
Lin 29/01/09 à

Hi Sig,

The example website is not working anymore. I tried to use amazon.com but the code seems not working properly.

lin
Lin 29/01/09 à

By the way, this is the website I used.

http://www.amazon.com/s/ref=nb_ss_etk_ce_av_?url=node%3D1065836%2C172630&field-keywords=&x=12&y=20

Could you help me out? Thank you a lot!
Sig Auteur de l’article29/01/09 à

mehdi : you are not on the speech related page.

Lin : OK. I see dvspot.com is dead. For any site (amazon included) you have to write your own crawler method. Take the dvspot_crawl() method above as an example and write your own amazon_crawl() method. Hopefully Amazon won’t require you to tweak preprocessors and its HTML will be understood by the mechanize framework without preprocessing.

Furthermore, this article is now 4 years old and the underlying libraries have evolved quite a bit (mechanize and beautifulsoup). If I were you, I would also try to adapt this example code to new versions of these libraries : this may help and simplify the work required for your amazon_crawl() method.

Good luck with this. Please publish your amazon example here so that other readers can have more examples to take inspiration from.
Lin 05/02/09 à

Hi Sig,

I sent you an email to sig@sig.levillage.org yesterday but I am not sure if that is the right account. If you get my email, could you send a reply?

Thank you,
Lin
Sig 05/02/09 à

Lin: message well received. I just answered it a couple of minutes ago. In brief : can’t help atm because of other priorities. Give dapper.net a try, too. Good luck !
Jorge Gonzalez 21/09/09 à

Another very good fit for this task is Scrapy, which is much faster than Mechanize (for crawling) and has a very nice and well documented API:

http://scrapy.org

It also has its own mechanism for extracting data from web pages using XPath, which is arguably more convenient (and definitely faster) than BeautifulSoup, but you can keep using BeautifulSoup for extracting data and just use its crawler.
Andrew A. Sailer 07/03/10 à

This is great! Thanks for your article. I am new at python and this is a big help.
Sig Auteur de l’article15/12/11 à

I tested scrapy and I confirm it is a very good solution for python crawlers and scrapers.

Les commentaires sont fermés.

Bytes for good

Innover, Servir, Entreprendre !

Web scraping with python (part 1 : crawling)

26 réflexions sur « Web scraping with python (part 1 : crawling) »