Commentaires sur : Web scraping with python (part 1 : crawling)

Par : Sig

Sig — Thu, 15 Dec 2011 15:12:00 +0000

I tested scrapy and I confirm it is a very good solution for python crawlers and scrapers.

Par : Andrew A. Sailer

Andrew A. Sailer — Sat, 06 Mar 2010 23:42:42 +0000

This is great! Thanks for your article. I am new at python and this is a big help.

Par : Jorge Gonzalez

Jorge Gonzalez — Mon, 21 Sep 2009 21:44:31 +0000

Another very good fit for this task is Scrapy, which is much faster than Mechanize (for crawling) and has a very nice and well documented API:

http://scrapy.org

It also has its own mechanism for extracting data from web pages using XPath, which is arguably more convenient (and definitely faster) than BeautifulSoup, but you can keep using BeautifulSoup for extracting data and just use its crawler.

Par : Sig

Sig — Thu, 05 Feb 2009 00:03:43 +0000

Lin: message well received. I just answered it a couple of minutes ago. In brief : can’t help atm because of other priorities. Give dapper.net a try, too. Good luck !

Par : Lin

Lin — Wed, 04 Feb 2009 22:19:24 +0000

Hi Sig,

I sent you an email to sig@sig.levillage.org yesterday but I am not sure if that is the right account. If you get my email, could you send a reply?

Thank you,
Lin

Par : Sig

Sig — Thu, 29 Jan 2009 20:03:50 +0000

mehdi : you are not on the speech related page.

Lin : OK. I see dvspot.com is dead. For any site (amazon included) you have to write your own crawler method. Take the dvspot_crawl() method above as an example and write your own amazon_crawl() method. Hopefully Amazon won’t require you to tweak preprocessors and its HTML will be understood by the mechanize framework without preprocessing.

Furthermore, this article is now 4 years old and the underlying libraries have evolved quite a bit (mechanize and beautifulsoup). If I were you, I would also try to adapt this example code to new versions of these libraries : this may help and simplify the work required for your amazon_crawl() method.

Good luck with this. Please publish your amazon example here so that other readers can have more examples to take inspiration from.

Par : Lin

Lin — Thu, 29 Jan 2009 16:58:48 +0000

By the way, this is the website I used.

http://www.amazon.com/s/ref=nb_ss_etk_ce_av_?url=node%3D1065836%2C172630&field-keywords=&x=12&y=20

Could you help me out? Thank you a lot!

Par : Lin

Lin — Thu, 29 Jan 2009 16:45:55 +0000

Hi Sig,

The example website is not working anymore. I tried to use amazon.com but the code seems not working properly.

lin

Par : mehdi

mehdi — Tue, 16 Sep 2008 14:51:33 +0000

I want a sdk or dll that convert speech to phonetic such as mbrola that convert phonetic to speech . Thanks a lot

Par : Sig

Sig — Mon, 25 Aug 2008 12:07:37 +0000

Sorry : no PHP version available unless you code one ! :)

Par : sunny

sunny — Mon, 25 Aug 2008 07:04:05 +0000

can i have the code for PHP

Par : Joe Elizondo

Joe Elizondo — Wed, 11 Jul 2007 20:46:29 +0000

No problem, if you ever make a version of the parser working with the newest beautiful soup release let me know! Still trying to tweak that in my spare time.

Par : Sig

Sig — Wed, 11 Jul 2007 13:49:27 +0000

Hey Joe, this is great! Thanks a lot for that update.

Par : Joe

Joe — Fri, 16 Mar 2007 01:54:09 +0000

Tweaked Code to work with new versions of scripts. Working as of March 15 2007:

from mechanize import Browser,LinkNotFoundError from ClientCookie import BaseHandler from StringIO import StringIO # import tidy # import sys import re from time import gmtime, strftime # # The following two line is specific to the site you want to crawl # it provides some capabilities to your crawler for it to be able # to understand the meaning of the data it is crawling ; # as an example for knowing the age of the crawled resource # from datetime import date # from my_parser import parsed_resource # """ Let's declare some customized pre-processors. These are useful when the HTML you are crawling through is not clean enough for mechanize. When you crawl through bad HTML, mechanize often raises errors. So either you tidy it with a strict tidy module (see TidyProcessor) or you tidy some errors you identified "by hand" (see MyProcessor). Note that because the tidy module is quite strict on HTML, it may change the whole structure of the page you are dealing with. As an example, in bad HTML, you may encounter nested forms or forms nested in tables or tables nested in forms. Tidying them may produce unintended results such as closing the form too early or making it empty. This is the reason you may have to use MyProcessor instead of TidyProcessor. """ # class FakeResponse: def __init__(self, resp, nudata): self._resp = resp self._sio = StringIO(nudata) # def __getattr__(self, name): try: return getattr(self._sio, name) except AttributeError: return getattr(self._resp, name) # class TidyProcessor(BaseHandler): def http_response(self, request, response): options = dict(output_xhtml=1, add_xml_decl=1, indent=1, output_encoding='utf8', input_encoding='latin1', force_output=1 ) r = tidy.parseString(response.read(), **options) return FakeResponse(response, str(r)) https_response = http_response # class MyProcessor(BaseHandler): def http_response(self, request, response): r = response.read() r = r.replace('"image""','"image"') r = r.replace('"','"') return FakeResponse(response, r) https_response = http_response # # Open a browser and optionally choose a customized HTML pre-processor b = Browser() b.add_handler(MyProcessor()) # """" Let's declare some utility methods that will enhance mechanize browsing capabilities """ # def find(response,searchst): response.seek(0) lr = response.read() return re.search(searchst, lr, re.I) # def save_response(response,kw='file'): """Saves last response to timestamped file""" name = strftime("%Y%m%d%H%M%S_",gmtime()) name = name + kw + '.html' f = open('./'+name,'w') response.seek(0) f.write(response.read()) f.close return "Response saved as %s" % name # """ Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl. """ # def dvspot_crawl(): """ Here starts the browsing session. For every move, I could have put as a comment an equivalent PBP command line. PBP is a nice scripting layer on top of mechanize. But it does not allow looping or conditional browsing. So I preferred scripting directly with mechanize instead of using PBP and then adding an additional layer of scripting on top of it. """ # MAX_NR_OF_ITEMS_PER_SESSION = 500 # # Go to home page # b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0") # # Navigate through the paginated list of cameras # next_page = 0 while next_page == 0: # # Display and save details of every listed item # url = b.geturl() next_element = 0 while next_element >= 0: try: response1 = b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element) next_element = next_element + 1 print save_response(response1,"dvspot_camera_"+str(next_element)) b.open(url) # if you crawled too many items, stop crawling if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION: next_element = -1 next_page = -1 except LinkNotFoundError: # You reached the last item in this page next_element = -1 # try: b.open(url) response2 = b.follow_link(text_regex=re.compile(r"Next Page"), nr=0) print "processing Next Page" except LinkNotFoundError: # You reached the last page of the listing of items next_page = -1 # return # # # if __name__ == '__main__': # """ Note that you may need to specify your proxy first. On windows, you do : set HTTP_PROXY=http://proxyname.bigcorp.com:8080 """ # dvspot_crawl()

Par : Joe

Joe — Fri, 16 Mar 2007 01:06:22 +0000

BaseProcessor seems to be causing problems in the code. Does it still exist? Is there a workaround?

Par : AkaSig » Blog Archive » Web scraping, web mashing

AkaSig » Blog Archive » Web scraping, web mashing — Thu, 08 Mar 2007 11:51:34 +0000

[…] 5 Ways to Mix, Rip, and Mash Your Data introduces promising web and desktop applications that extract structured data feeds from web sites and mix them together into something possibly useful to you. Think of things like getting filtered Monster job ads as a convenient RSS feed, along with job ads from your other favorite job sites. This reminds me my Python hacks for automating web crawling and web scraping. Sometimes, I wish I could find time for working a bit further on that… […]

Par : Sig

Sig — Tue, 23 Aug 2005 07:01:25 +0000

Alex, I think your SWEA is great. Your screencast is really impressive. I wish we had an equivalent capability to use Mozilla’s Gecko engine from Python. People preferring GUIs to raw python scripts should definitely have a look at SWEA. Thanks for your feedback, Alex.

Par : Alex

Alex — Thu, 18 Aug 2005 17:26:50 +0000

Take a look at SWExplorerAutomation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer.

Par : agb

agb — Fri, 05 Aug 2005 15:04:37 +0000

I think the reason for using mechanoid is that it seems to be more self contained. Looking at the mechanize dox, install consists of a few more packages in addition to mechanize. So, mechanoid seems easier.

But this is the perspective of a guy that’s researching session scraping techniques and has yet to use either package, so ymmv.

Par : Lethalman

Lethalman — Sun, 03 Apr 2005 12:35:13 +0000

I’m using mechanize too and i would like to try mechanoid but i didn’t find examples on how to use it… does someone know where to find very simple examples to use mechanoid?