Commentaires sur : Web scraping with Python (part II)

Par : Tripp Lilley

Tripp Lilley — Wed, 25 Aug 2010 19:16:01 +0000

« I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it…)… :( »

Don’t worry about patentability… Template::Extract, a Perl module, available at CPAN, predates this by a year, and so should give your employer no grounds for claiming novelty of the invention.

Version 0.36, the first release available at CPAN:
http://search.cpan.org/~autrijus/Template-Extract-0.36/

I, too, thought that « reverse templating » would be a good way to approach this set of problems. I was part way into the « thinking about how it would work » process when I found Template::Extract, which freed me up to think about other problems :-)

Par : evden eve nakliye

evden eve nakliye — Fri, 06 Mar 2009 10:33:45 +0000

Thanks for this text.

Par : Sig

Sig — Tue, 03 Mar 2009 08:35:14 +0000

DF asks me (by email) :

I read your articles on Web Scraping with Python and I’m wondering if over the years you’ve come across more advanced ways to solve your problems.

More specifically, do you know of any practical way to extract the same type of data (ie. camera products) from multiple websites of varying structure, without having to custom code for each website (or perhaps very minimal custom code).

I know there are some websites capable of doing this all automatically. Take for instance http://www.vast.com – they have loads of data.

Do you have any insights into this kind of technology?

Thanks Sig – I’d love to hear what you have to say…

Here is my answer :

First of all, the ideal solution remains to have the sites publish this data in a structured way (say RDF/OWL or JSON for instance). Most often, if they don’t publish in such a structured way, it may mean that they don’t allow you to scrape their data and you may get into legal troubles because of copyright laws.

That being said, there have been attempts at easing the process of custom-coding the scraping of specific sites. The 2 most interesting solutions I played with (a couple of years ago) are Openkapow and Dapper.

The advantage of Openkapow on custom script-based scraping is that it offers a rich scrape-robot development environment (GUI) which eases the process of analyzing the HTML structure. But running and exploiting these robots has revealed to be not as flexible and easy as running your own homemade scrapers.

Dapper has a significant strength : it allows structure to be learnt by the machine based on examples. You provide dappers with several samples of pages to extract data from and it « automagically » identifies recurrent HTML patterns which allow it to extract data. There must be some machine learning algorithm behind it AFAICS. But the drawbacks of dapper are : these algorithms are OK for 80% cases but the other 20% won’t be parseable by Dapper, and Dapper requires the page to be a list of many items (think paginated list of a search results). Dapper does not seem to be suitable for the technical sheet of a camera for instance. And Dapper scrapers can’t easily be combined : you can’t easily script the navigation in a complex site unless you combine dapper with things like Yahoo pipes.

As a conclusion, I would say that simple and easily accessible paginated lists of results deserve some dappering some hesitation. Openkapow is the tool to use if you can’t script by yourself. But the definite answer to complex and robust scraping remains homemade scripts.

There may be other valuable alternatives I don’t know. I have not been spending much time on scraping since I wrote this article.

Please come and share the results of your own experiments as further comments !

Par : Sig

Sig — Mon, 23 Feb 2009 09:36:53 +0000

En réponse à web tasarım. Of course. It parses HTML, whatever the application engine behind the pages is.

Par : web tasarım

web tasarım — Mon, 23 Feb 2009 01:43:14 +0000

May I use it to parse asp.net pages?

Par : emrinho

emrinho — Tue, 13 Jan 2009 20:26:22 +0000

Wow pal awesome tutorial. Thanks for the information but I kinda got stuck at step 6. Couldn’t properly added the attiributes. Well thanks anyway. It was a different experience for me.

Par : Sig

Sig — Mon, 24 Apr 2006 15:02:33 +0000

Juancho: thx for your thx. :)

I’d hope I will have some opportunity to refresh this code a bit and to extend its functionalities. Unfortunately, at the moment, I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it…)… :(

Par : Juancho

Juancho — Wed, 29 Mar 2006 01:08:25 +0000

I know this is way after the fact, but Adam, I had that problem as well. uTidy lib wraps a C lib (i think) and therefore uses the Ctypes package to execute C stuff from Python.

If your installation doesn’t have cTypes installed, then uTidylib has its own version that it tries to use (in pvt_ctypes). But that version didn’t work on my computer either – i think it is designed for python 2.3.

Solution, just install the latest cTypes library from sourceforge on your computer, and then utidy will use that version, and not its own private version. I hope that helps.

Sig – thx much for the code. I’ve been playing with it extensively for some time now.

Par : Sig

Sig — Tue, 13 Sep 2005 13:08:54 +0000

Adam, I apologize for having forgotten to answer you sooner (job switch + holidays in-between)... The traceback you provides says that your tidy lib complains about some DLL that can't get loaded. I suggest that you check your uTidylib installation. Maybe try to uninstall/re-install and see if it fixes your problem. If it does not work, I suggest that you ask for support to the uTidylib project team. Once again I sincerely apologize and I hope that you could fix this problem.

Par : adam

adam — Fri, 03 Jun 2005 08:18:58 +0000

Dear Sig,
Thanks for your hard work!
I have a trouble when I use your the project ‘s ZIP distribution .
I installed these packages obey your guide step by step,but when I run the « web_parser »,I had been told « ImportError: DLL load failed ».

The error messages output to shell console is following:
Traceback (most recent call last):
File « D:\download\python\web_parser\web_parser.py », line 8, in ?
import tidy
File « D:\Python24\Lib\site-packages\tidy\__init__.py », line 38, in ?
from tidy.lib import parse, parseString
File « D:\Python24\Lib\site-packages\tidy\lib.py », line 16, in ?
import ctypes
File « D:\Python24\lib\site-packages\tidy\pvt_ctypes\ctypes.zip\ctypes\__init__.py », line 13, in ?
ImportError: DLL load failed: 找不到指定的模块。

Can you help me,thanks.

Par : Sig

Sig — Mon, 30 May 2005 13:58:46 +0000

Adam : yes, you may. This piece of code does not care about the technology that generates your web pages. Indeed, once your ASP.Net code has been accessed and run by your web server, the output sent to the web browser is pure HTML. The only technology that it may have problems with is Javascript when there is too much of it in a web page (when most of the HTML is generated at run-time on client-side). Other limit : when your HTML is really really far from valid, then you may have to tidy it with an included pre-processor (see the source code for more explanation on this point).

So anyway, yes, you can certainly parse HTML generated by ASP.Net pages.

Par : adam

adam — Mon, 30 May 2005 08:51:16 +0000

May I use it to parse asp.net pages?

Par : Sig

Sig — Tue, 24 May 2005 14:54:23 +0000

Alex : it sounds nice. Do you have some piece of code so that we can try it ?

Par : alex

alex — Mon, 23 May 2005 13:38:05 +0000

Thanks. Currently working further on a similar idea. Compare two pages from the same site to get the modifier fields. With very litle programmed inteligence you can then get article description, date, title, price … a la Froogle.

Par : taprackbang

taprackbang — Wed, 13 Apr 2005 02:57:54 +0000

missing template part: <div class="buying" pseudotal:content="title_authors" pseudotal:option="FULL_CONTENT">

Par : taprackbang

taprackbang — Wed, 13 Apr 2005 02:56:36 +0000

since title and authors are mixed together in the div node, so I want to get the whole div tag and process it myself. In my template, i have . but my script still treats title_authors regularly and were not returning whole html string. after I put in my change, then it works return the whole div tag.

Par : Sig

Sig — Tue, 12 Apr 2005 08:55:27 +0000

taprackbang : thank you for your suggestion ; it looks better than the current version. But why do you say that the option wasn’t working for you ? Please describe the problem you were encountering.

Par : taprackbang

taprackbang — Thu, 07 Apr 2005 00:35:44 +0000

the option wasn’t working for me. then I put one line:

try:
option = finish[‘pseudotal:option’]
path.append(option) # my change

near the end of template.shortest_path(), and it works, becuase extract() looks for that option value. what u think?

Par : Sig

Sig — Wed, 16 Mar 2005 11:01:36 +0000

Quick notice : Leonard Richardson (BeautifulSoup’s author) gave me a positive feedback about my suggested modifications. He is to include these features into the version 2.0 of BeautifulSoup he is working on. I don’t plan on making any additional contribution on BeautifulSoup in the near future but I will certainly update this web parser as soon as v.2.0 of BS is available. Thank you Leonard !

Par : Sig

Sig — Mon, 14 Mar 2005 08:45:49 +0000

taprackbang : I should package this stuff so that it gets installed more easily but I don't have any precise idea about how to do this. Regarding malformed HTML (Amazon's mismatched form tag), I suggest that you write an appropriate regex in a "MyProcessor" class-like preprocessor as suggested in the first part of this article. Your pre-processor will help crawling Amazon and saving its pages in a more adequate format. Thanks all for your positive feedback : keep on commenting, I enjoy that a lot ! :-)