<?xml version="1.0" encoding="ISO-8859-15"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Web scraping with Python (part II)</title>
	<atom:link href="http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/</link>
	<description>Innover, servir, entreprendre.</description>
	<lastBuildDate>Tue, 16 Mar 2010 14:17:59 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: evden eve nakliye</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-140097</link>
		<dc:creator>evden eve nakliye</dc:creator>
		<pubDate>Fri, 06 Mar 2009 10:33:45 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-140097</guid>
		<description>Thanks for this text.</description>
		<content:encoded><![CDATA[<p>Thanks for this text.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-139933</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 03 Mar 2009 08:35:14 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-139933</guid>
		<description>DF asks me (by email) :

&lt;blockquote&gt;I read your articles on Web Scraping with Python and I&#039;m wondering if over the years you&#039;ve come across more advanced ways to solve your problems.

More specifically, do you know of any practical way to extract the same type of data (ie. camera products) from multiple websites of varying structure, without having to custom code for each website (or perhaps very minimal custom code).

I know there are some websites capable of doing this all automatically.  Take for instance www.vast.com - they have loads of data.

Do you have any insights into this kind of technology?

Thanks Sig - I&#039;d love to hear what you have to say...
&lt;/blockquote&gt;

Here is my answer :

First of all, the ideal solution remains to have the sites publish this data in a structured way (say RDF/OWL or JSON for instance). Most often, if they don&#039;t publish in such a structured way, it may mean that they don&#039;t allow you to scrape their data and you may get into legal troubles because of copyright laws.

That being said, there have been attempts at easing the process of custom-coding the scraping of specific sites. The 2 most interesting solutions I played with (a couple of years ago) are Openkapow and Dapper.

The advantage of Openkapow on custom script-based scraping is that it offers a rich scrape-robot development environment (GUI) which eases the process of analyzing the HTML structure. But running and exploiting these robots has revealed to be not as flexible and easy as running your own homemade scrapers.

Dapper has a significant strength : it allows structure to be learnt by the machine based on examples. You provide dappers with several samples of pages to extract data from and it &quot;automagically&quot; identifies recurrent HTML patterns which allow it to extract data. There must be some machine learning algorithm behind it AFAICS. But the drawbacks of dapper are : these algorithms are OK for 80% cases but the other 20% won&#039;t be parseable by Dapper, and Dapper requires the page to be a list of many items (think paginated list of a search results). Dapper does not seem to be suitable for the technical sheet of a camera for instance. And Dapper scrapers can&#039;t easily be combined : you can&#039;t easily script the navigation in a complex site unless you combine dapper with things like Yahoo pipes.

As a conclusion, I would say that simple and easily accessible paginated lists of results deserve some dappering some hesitation. Openkapow is the tool to use if you can&#039;t script by yourself. But the definite answer to complex and robust scraping remains homemade scripts.

There may be other valuable alternatives I don&#039;t know. I have not been spending much time on scraping since I wrote this article.

Please come and share the results of your own experiments as further comments !</description>
		<content:encoded><![CDATA[<p>DF asks me (by email) :</p>
<blockquote><p>I read your articles on Web Scraping with Python and I&#8217;m wondering if over the years you&#8217;ve come across more advanced ways to solve your problems.</p>
<p>More specifically, do you know of any practical way to extract the same type of data (ie. camera products) from multiple websites of varying structure, without having to custom code for each website (or perhaps very minimal custom code).</p>
<p>I know there are some websites capable of doing this all automatically.  Take for instance <a href="http://www.vast.com">http://www.vast.com</a> &#8211; they have loads of data.</p>
<p>Do you have any insights into this kind of technology?</p>
<p>Thanks Sig &#8211; I&#8217;d love to hear what you have to say&#8230;
</p></blockquote>
<p>Here is my answer :</p>
<p>First of all, the ideal solution remains to have the sites publish this data in a structured way (say RDF/OWL or JSON for instance). Most often, if they don&#8217;t publish in such a structured way, it may mean that they don&#8217;t allow you to scrape their data and you may get into legal troubles because of copyright laws.</p>
<p>That being said, there have been attempts at easing the process of custom-coding the scraping of specific sites. The 2 most interesting solutions I played with (a couple of years ago) are Openkapow and Dapper.</p>
<p>The advantage of Openkapow on custom script-based scraping is that it offers a rich scrape-robot development environment (GUI) which eases the process of analyzing the HTML structure. But running and exploiting these robots has revealed to be not as flexible and easy as running your own homemade scrapers.</p>
<p>Dapper has a significant strength : it allows structure to be learnt by the machine based on examples. You provide dappers with several samples of pages to extract data from and it &#8220;automagically&#8221; identifies recurrent HTML patterns which allow it to extract data. There must be some machine learning algorithm behind it AFAICS. But the drawbacks of dapper are : these algorithms are OK for 80% cases but the other 20% won&#8217;t be parseable by Dapper, and Dapper requires the page to be a list of many items (think paginated list of a search results). Dapper does not seem to be suitable for the technical sheet of a camera for instance. And Dapper scrapers can&#8217;t easily be combined : you can&#8217;t easily script the navigation in a complex site unless you combine dapper with things like Yahoo pipes.</p>
<p>As a conclusion, I would say that simple and easily accessible paginated lists of results deserve some dappering some hesitation. Openkapow is the tool to use if you can&#8217;t script by yourself. But the definite answer to complex and robust scraping remains homemade scripts.</p>
<p>There may be other valuable alternatives I don&#8217;t know. I have not been spending much time on scraping since I wrote this article.</p>
<p>Please come and share the results of your own experiments as further comments !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-139583</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 23 Feb 2009 09:36:53 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-139583</guid>
		<description>Of course. It parses HTML, whatever the application engine behind the pages is.</description>
		<content:encoded><![CDATA[<p>Of course. It parses HTML, whatever the application engine behind the pages is.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: web tasar&#305;m</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-139562</link>
		<dc:creator>web tasar&#305;m</dc:creator>
		<pubDate>Mon, 23 Feb 2009 01:43:14 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-139562</guid>
		<description>May I use it to parse asp.net pages?</description>
		<content:encoded><![CDATA[<p>May I use it to parse asp.net pages?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: emrinho</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-137841</link>
		<dc:creator>emrinho</dc:creator>
		<pubDate>Tue, 13 Jan 2009 20:26:22 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-137841</guid>
		<description>Wow pal awesome tutorial. Thanks for the information but I kinda got stuck at step 6. Couldn&#039;t properly added the attiributes. Well thanks anyway. It was a different experience for me.</description>
		<content:encoded><![CDATA[<p>Wow pal awesome tutorial. Thanks for the information but I kinda got stuck at step 6. Couldn&#8217;t properly added the attiributes. Well thanks anyway. It was a different experience for me.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-44853</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 24 Apr 2006 15:02:33 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-44853</guid>
		<description>Juancho: thx for your thx. :)

I&#039;d hope I will have some opportunity to refresh this code a bit and to extend its functionalities. Unfortunately, at the moment, I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it...)... :(</description>
		<content:encoded><![CDATA[<p>Juancho: thx for your thx. :)</p>
<p>I&#8217;d hope I will have some opportunity to refresh this code a bit and to extend its functionalities. Unfortunately, at the moment, I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it&#8230;)&#8230; :(</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Juancho</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-44599</link>
		<dc:creator>Juancho</dc:creator>
		<pubDate>Wed, 29 Mar 2006 01:08:25 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-44599</guid>
		<description>I know this is way after the fact, but Adam, I had that problem as well.  uTidy lib wraps a C lib (i think) and therefore uses the Ctypes package to execute C stuff from Python.  

If your installation doesn&#039;t have cTypes installed, then uTidylib has its own version that it tries to use (in pvt_ctypes).  But that version didn&#039;t work on my computer either - i think it is designed for python 2.3.

Solution, just install the latest cTypes library from sourceforge on your computer, and then utidy will use that version, and not its own private version.  I hope that helps.

Sig - thx much for the code.  I&#039;ve been playing with it extensively for some time now.</description>
		<content:encoded><![CDATA[<p>I know this is way after the fact, but Adam, I had that problem as well.  uTidy lib wraps a C lib (i think) and therefore uses the Ctypes package to execute C stuff from Python.  </p>
<p>If your installation doesn&#8217;t have cTypes installed, then uTidylib has its own version that it tries to use (in pvt_ctypes).  But that version didn&#8217;t work on my computer either &#8211; i think it is designed for python 2.3.</p>
<p>Solution, just install the latest cTypes library from sourceforge on your computer, and then utidy will use that version, and not its own private version.  I hope that helps.</p>
<p>Sig &#8211; thx much for the code.  I&#8217;ve been playing with it extensively for some time now.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-40987</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 13 Sep 2005 13:08:54 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-40987</guid>
		<description>Adam,

I apologize for having forgotten to answer you sooner (job switch + holidays in-between)...

The traceback you provides says that your tidy lib complains about some DLL that can&#039;t get loaded. I suggest that you check your uTidylib installation. Maybe try to uninstall/re-install and see if it fixes your problem.

If it does not work, I suggest that you ask for support to &lt;a href=&quot;http://developer.berlios.de/projects/utidylib&quot; rel=&quot;nofollow&quot;&gt;the uTidylib project team&lt;/a&gt;.

Once again I sincerely apologize and I hope that you could fix this problem.</description>
		<content:encoded><![CDATA[<p>Adam,</p>
<p>I apologize for having forgotten to answer you sooner (job switch + holidays in-between)&#8230;</p>
<p>The traceback you provides says that your tidy lib complains about some DLL that can&#8217;t get loaded. I suggest that you check your uTidylib installation. Maybe try to uninstall/re-install and see if it fixes your problem.</p>
<p>If it does not work, I suggest that you ask for support to <a href="http://developer.berlios.de/projects/utidylib">the uTidylib project team</a>.</p>
<p>Once again I sincerely apologize and I hope that you could fix this problem.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: adam</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-36009</link>
		<dc:creator>adam</dc:creator>
		<pubDate>Fri, 03 Jun 2005 08:18:58 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-36009</guid>
		<description>Dear Sig,
  Thanks for your hard work!
I have a trouble when I use your the project &#039;s ZIP distribution .
I installed these  packages obey your guide step by step,but when I run the &quot;web_parser&quot;,I had been told &quot;ImportError: DLL load failed&quot;.

The error messages output to shell console is following:
Traceback (most recent call last):
  File &quot;D:\download\python\web_parser\web_parser.py&quot;, line 8, in ?
    import tidy
  File &quot;D:\Python24\Lib\site-packages\tidy\__init__.py&quot;, line 38, in ?
    from tidy.lib import parse, parseString
  File &quot;D:\Python24\Lib\site-packages\tidy\lib.py&quot;, line 16, in ?
    import ctypes
  File &quot;D:\Python24\lib\site-packages\tidy\pvt_ctypes\ctypes.zip\ctypes\__init__.py&quot;, line 13, in ?
ImportError: DLL load failed: &#25214;&#19981;&#21040;&#25351;&#23450;&#30340;&#27169;&#22359;&#12290;

Can you help me,thanks.
</description>
		<content:encoded><![CDATA[<p>Dear Sig,<br />
  Thanks for your hard work!<br />
I have a trouble when I use your the project &#8217;s ZIP distribution .<br />
I installed these  packages obey your guide step by step,but when I run the &#8220;web_parser&#8221;,I had been told &#8220;ImportError: DLL load failed&#8221;.</p>
<p>The error messages output to shell console is following:<br />
Traceback (most recent call last):<br />
  File &#8220;D:\download\python\web_parser\web_parser.py&#8221;, line 8, in ?<br />
    import tidy<br />
  File &#8220;D:\Python24\Lib\site-packages\tidy\__init__.py&#8221;, line 38, in ?<br />
    from tidy.lib import parse, parseString<br />
  File &#8220;D:\Python24\Lib\site-packages\tidy\lib.py&#8221;, line 16, in ?<br />
    import ctypes<br />
  File &#8220;D:\Python24\lib\site-packages\tidy\pvt_ctypes\ctypes.zip\ctypes\__init__.py&#8221;, line 13, in ?<br />
ImportError: DLL load failed: &#25214;&#19981;&#21040;&#25351;&#23450;&#30340;&#27169;&#22359;&#12290;</p>
<p>Can you help me,thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-35834</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 30 May 2005 13:58:46 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35834</guid>
		<description>Adam : yes, you may. This piece of code does not care about the technology that generates your web pages. Indeed, once your ASP.Net code has been accessed and run by your web server, the output sent to the web browser is pure HTML. The only technology that it may have problems with is Javascript when there is too much of it in a web page (when most of the HTML is generated at run-time on client-side). Other limit : when your HTML is really really far from valid, then you may have to tidy it with an included pre-processor (see the source code for more explanation on this point).

So anyway, yes, you can certainly parse HTML generated by ASP.Net pages.</description>
		<content:encoded><![CDATA[<p>Adam : yes, you may. This piece of code does not care about the technology that generates your web pages. Indeed, once your ASP.Net code has been accessed and run by your web server, the output sent to the web browser is pure HTML. The only technology that it may have problems with is Javascript when there is too much of it in a web page (when most of the HTML is generated at run-time on client-side). Other limit : when your HTML is really really far from valid, then you may have to tidy it with an included pre-processor (see the source code for more explanation on this point).</p>
<p>So anyway, yes, you can certainly parse HTML generated by ASP.Net pages.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: adam</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-35833</link>
		<dc:creator>adam</dc:creator>
		<pubDate>Mon, 30 May 2005 08:51:16 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35833</guid>
		<description>May I use it to parse asp.net pages?</description>
		<content:encoded><![CDATA[<p>May I use it to parse asp.net pages?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-35644</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 24 May 2005 14:54:23 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35644</guid>
		<description>Alex : it sounds nice. Do you have some piece of code so that we can try it ?</description>
		<content:encoded><![CDATA[<p>Alex : it sounds nice. Do you have some piece of code so that we can try it ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: alex</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-35562</link>
		<dc:creator>alex</dc:creator>
		<pubDate>Mon, 23 May 2005 13:38:05 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35562</guid>
		<description>Thanks. Currently working further on a similar idea. Compare two pages from the same site to get the modifier fields. With very litle programmed inteligence you can then get article description, date, title, price ... a la Froogle.</description>
		<content:encoded><![CDATA[<p>Thanks. Currently working further on a similar idea. Compare two pages from the same site to get the modifier fields. With very litle programmed inteligence you can then get article description, date, title, price &#8230; a la Froogle.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-28153</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Wed, 13 Apr 2005 02:57:54 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-28153</guid>
		<description>missing template part: 
&lt;div class=&quot;buying&quot;  pseudotal:content=&quot;title_authors&quot; pseudotal:option=&quot;FULL_CONTENT&quot;&gt;</description>
		<content:encoded><![CDATA[<p>missing template part:<br />
&lt;div class=&#8221;buying&#8221;  pseudotal:content=&#8221;title_authors&#8221; pseudotal:option=&#8221;FULL_CONTENT&#8221;&gt;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-28152</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Wed, 13 Apr 2005 02:56:36 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-28152</guid>
		<description>since title and authors are mixed together in the div node, so I want to get the whole div tag and process it myself.  In my template, i have . but my script still treats title_authors regularly and were not returning whole html string. after I put in my change, then it works return the whole div tag.</description>
		<content:encoded><![CDATA[<p>since title and authors are mixed together in the div node, so I want to get the whole div tag and process it myself.  In my template, i have . but my script still treats title_authors regularly and were not returning whole html string. after I put in my change, then it works return the whole div tag.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-27967</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 12 Apr 2005 08:55:27 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-27967</guid>
		<description>taprackbang : thank you for your suggestion ; it looks better than the current version. But why do you say that the option wasn&#039;t working for you ? Please describe the problem you were encountering.</description>
		<content:encoded><![CDATA[<p>taprackbang : thank you for your suggestion ; it looks better than the current version. But why do you say that the option wasn&#8217;t working for you ? Please describe the problem you were encountering.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-25775</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Thu, 07 Apr 2005 00:35:44 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-25775</guid>
		<description>the option wasn&#039;t working for me. then I put one line:

       try:
          option = finish[&#039;pseudotal:option&#039;]
          path.append(option)  # my change

near the end of  template.shortest_path(), and it works, becuase extract() looks for that option value. what u think?</description>
		<content:encoded><![CDATA[<p>the option wasn&#8217;t working for me. then I put one line:</p>
<p>       try:<br />
          option = finish['pseudotal:option']<br />
          path.append(option)  # my change</p>
<p>near the end of  template.shortest_path(), and it works, becuase extract() looks for that option value. what u think?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-19566</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Wed, 16 Mar 2005 11:01:36 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-19566</guid>
		<description>Quick notice : Leonard Richardson (BeautifulSoup&#039;s author) gave me a positive feedback about my suggested modifications. He is to include these features into the version 2.0 of BeautifulSoup he is working on. I don&#039;t plan on making any additional contribution on BeautifulSoup in the near future but I will certainly update this web parser as soon as v.2.0 of BS is available. Thank you Leonard !</description>
		<content:encoded><![CDATA[<p>Quick notice : Leonard Richardson (BeautifulSoup&#8217;s author) gave me a positive feedback about my suggested modifications. He is to include these features into the version 2.0 of BeautifulSoup he is working on. I don&#8217;t plan on making any additional contribution on BeautifulSoup in the near future but I will certainly update this web parser as soon as v.2.0 of BS is available. Thank you Leonard !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-19466</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 14 Mar 2005 08:45:49 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-19466</guid>
		<description>taprackbang : I should package this stuff so that it gets installed more easily but I don&#039;t have any precise idea about how to do this. Regarding malformed HTML (Amazon&#039;s mismatched form tag), I suggest that you write an appropriate regex in a &quot;MyProcessor&quot; class-like preprocessor &lt;a href=&quot;http://sig.levillage.org/?p=588&quot; rel=&quot;nofollow&quot;&gt;as suggested in the first part of this article&lt;/a&gt;. Your pre-processor will help crawling Amazon and saving its pages in a more adequate format.

Thanks all for your positive feedback : keep on commenting, I enjoy that a lot !  :-)</description>
		<content:encoded><![CDATA[<p>taprackbang : I should package this stuff so that it gets installed more easily but I don&#8217;t have any precise idea about how to do this. Regarding malformed HTML (Amazon&#8217;s mismatched form tag), I suggest that you write an appropriate regex in a &#8220;MyProcessor&#8221; class-like preprocessor <a href="http://sig.levillage.org/?p=588">as suggested in the first part of this article</a>. Your pre-processor will help crawling Amazon and saving its pages in a more adequate format.</p>
<p>Thanks all for your positive feedback : keep on commenting, I enjoy that a lot !  :-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/comment-page-1/#comment-19343</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Sat, 12 Mar 2005 17:44:05 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-19343</guid>
		<description>This is great. After spending great deal of time to setup all the prerequisite packages, i was able to scrape the amazon&#039;s webpage to get book title, price easily.  However, amazon&#039;s webpage has mismatched form tag which caused tidy to choke, so I had to manually remove all the form tags before tidy and web_parser function calls.

It works right out of box, and made web scrapping so much easier. Great work. Thanks!! </description>
		<content:encoded><![CDATA[<p>This is great. After spending great deal of time to setup all the prerequisite packages, i was able to scrape the amazon&#8217;s webpage to get book title, price easily.  However, amazon&#8217;s webpage has mismatched form tag which caused tidy to choke, so I had to manually remove all the form tags before tidy and web_parser function calls.</p>
<p>It works right out of box, and made web scrapping so much easier. Great work. Thanks!!</p>
]]></content:encoded>
	</item>
</channel>
</rss>
