<?xml version="1.0" encoding="ISO-8859-15"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Web scraping with Python (part II)</title>
	<atom:link href="http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/</link>
	<description>Innover, servir, entreprendre.</description>
	<pubDate>Thu, 04 Dec 2008 01:17:50 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
		<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-44853</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 24 Apr 2006 15:02:33 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-44853</guid>
		<description>Juancho: thx for your thx. :)

I'd hope I will have some opportunity to refresh this code a bit and to extend its functionalities. Unfortunately, at the moment, I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it...)... :(</description>
		<content:encoded><![CDATA[<p>Juancho: thx for your thx. :)</p>
<p>I&#8217;d hope I will have some opportunity to refresh this code a bit and to extend its functionalities. Unfortunately, at the moment, I am struggling with these clauses in my new job contract that let my employer kind-of own anything I create (in case there would be some software patents to produce out of it&#8230;)&#8230; :(</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Juancho</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-44599</link>
		<dc:creator>Juancho</dc:creator>
		<pubDate>Wed, 29 Mar 2006 01:08:25 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-44599</guid>
		<description>I know this is way after the fact, but Adam, I had that problem as well.  uTidy lib wraps a C lib (i think) and therefore uses the Ctypes package to execute C stuff from Python.  

If your installation doesn't have cTypes installed, then uTidylib has its own version that it tries to use (in pvt_ctypes).  But that version didn't work on my computer either - i think it is designed for python 2.3.

Solution, just install the latest cTypes library from sourceforge on your computer, and then utidy will use that version, and not its own private version.  I hope that helps.

Sig - thx much for the code.  I've been playing with it extensively for some time now.</description>
		<content:encoded><![CDATA[<p>I know this is way after the fact, but Adam, I had that problem as well.  uTidy lib wraps a C lib (i think) and therefore uses the Ctypes package to execute C stuff from Python.  </p>
<p>If your installation doesn&#8217;t have cTypes installed, then uTidylib has its own version that it tries to use (in pvt_ctypes).  But that version didn&#8217;t work on my computer either - i think it is designed for python 2.3.</p>
<p>Solution, just install the latest cTypes library from sourceforge on your computer, and then utidy will use that version, and not its own private version.  I hope that helps.</p>
<p>Sig - thx much for the code.  I&#8217;ve been playing with it extensively for some time now.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-40987</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 13 Sep 2005 13:08:54 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-40987</guid>
		<description>Adam,

I apologize for having forgotten to answer you sooner (job switch + holidays in-between)...

The traceback you provides says that your tidy lib complains about some DLL that can't get loaded. I suggest that you check your uTidylib installation. Maybe try to uninstall/re-install and see if it fixes your problem.

If it does not work, I suggest that you ask for support to &lt;a href="http://developer.berlios.de/projects/utidylib" rel="nofollow"&gt;the uTidylib project team&lt;/a&gt;.

Once again I sincerely apologize and I hope that you could fix this problem.</description>
		<content:encoded><![CDATA[<p>Adam,</p>
<p>I apologize for having forgotten to answer you sooner (job switch + holidays in-between)&#8230;</p>
<p>The traceback you provides says that your tidy lib complains about some DLL that can&#8217;t get loaded. I suggest that you check your uTidylib installation. Maybe try to uninstall/re-install and see if it fixes your problem.</p>
<p>If it does not work, I suggest that you ask for support to <a href="http://developer.berlios.de/projects/utidylib">the uTidylib project team</a>.</p>
<p>Once again I sincerely apologize and I hope that you could fix this problem.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: adam</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-36009</link>
		<dc:creator>adam</dc:creator>
		<pubDate>Fri, 03 Jun 2005 08:18:58 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-36009</guid>
		<description>Dear Sig,
  Thanks for your hard work!
I have a trouble when I use your the project 's ZIP distribution .
I installed these  packages obey your guide step by step,but when I run the "web_parser",I had been told "ImportError: DLL load failed".

The error messages output to shell console is following:
Traceback (most recent call last):
  File "D:\download\python\web_parser\web_parser.py", line 8, in ?
    import tidy
  File "D:\Python24\Lib\site-packages\tidy\__init__.py", line 38, in ?
    from tidy.lib import parse, parseString
  File "D:\Python24\Lib\site-packages\tidy\lib.py", line 16, in ?
    import ctypes
  File "D:\Python24\lib\site-packages\tidy\pvt_ctypes\ctypes.zip\ctypes\__init__.py", line 13, in ?
ImportError: DLL load failed: &#25214;&#19981;&#21040;&#25351;&#23450;&#30340;&#27169;&#22359;&#12290;

Can you help me,thanks.
</description>
		<content:encoded><![CDATA[<p>Dear Sig,<br />
  Thanks for your hard work!<br />
I have a trouble when I use your the project &#8217;s ZIP distribution .<br />
I installed these  packages obey your guide step by step,but when I run the &#8220;web_parser&#8221;,I had been told &#8220;ImportError: DLL load failed&#8221;.</p>
<p>The error messages output to shell console is following:<br />
Traceback (most recent call last):<br />
  File &#8220;D:\download\python\web_parser\web_parser.py&#8221;, line 8, in ?<br />
    import tidy<br />
  File &#8220;D:\Python24\Lib\site-packages\tidy\__init__.py&#8221;, line 38, in ?<br />
    from tidy.lib import parse, parseString<br />
  File &#8220;D:\Python24\Lib\site-packages\tidy\lib.py&#8221;, line 16, in ?<br />
    import ctypes<br />
  File &#8220;D:\Python24\lib\site-packages\tidy\pvt_ctypes\ctypes.zip\ctypes\__init__.py&#8221;, line 13, in ?<br />
ImportError: DLL load failed: &#25214;&#19981;&#21040;&#25351;&#23450;&#30340;&#27169;&#22359;&#12290;</p>
<p>Can you help me,thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-35834</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 30 May 2005 13:58:46 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35834</guid>
		<description>Adam : yes, you may. This piece of code does not care about the technology that generates your web pages. Indeed, once your ASP.Net code has been accessed and run by your web server, the output sent to the web browser is pure HTML. The only technology that it may have problems with is Javascript when there is too much of it in a web page (when most of the HTML is generated at run-time on client-side). Other limit : when your HTML is really really far from valid, then you may have to tidy it with an included pre-processor (see the source code for more explanation on this point).

So anyway, yes, you can certainly parse HTML generated by ASP.Net pages.</description>
		<content:encoded><![CDATA[<p>Adam : yes, you may. This piece of code does not care about the technology that generates your web pages. Indeed, once your ASP.Net code has been accessed and run by your web server, the output sent to the web browser is pure HTML. The only technology that it may have problems with is Javascript when there is too much of it in a web page (when most of the HTML is generated at run-time on client-side). Other limit : when your HTML is really really far from valid, then you may have to tidy it with an included pre-processor (see the source code for more explanation on this point).</p>
<p>So anyway, yes, you can certainly parse HTML generated by ASP.Net pages.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: adam</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-35833</link>
		<dc:creator>adam</dc:creator>
		<pubDate>Mon, 30 May 2005 08:51:16 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35833</guid>
		<description>May I use it to parse asp.net pages?</description>
		<content:encoded><![CDATA[<p>May I use it to parse asp.net pages?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-35644</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 24 May 2005 14:54:23 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35644</guid>
		<description>Alex : it sounds nice. Do you have some piece of code so that we can try it ?</description>
		<content:encoded><![CDATA[<p>Alex : it sounds nice. Do you have some piece of code so that we can try it ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: alex</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-35562</link>
		<dc:creator>alex</dc:creator>
		<pubDate>Mon, 23 May 2005 13:38:05 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-35562</guid>
		<description>Thanks. Currently working further on a similar idea. Compare two pages from the same site to get the modifier fields. With very litle programmed inteligence you can then get article description, date, title, price ... a la Froogle.</description>
		<content:encoded><![CDATA[<p>Thanks. Currently working further on a similar idea. Compare two pages from the same site to get the modifier fields. With very litle programmed inteligence you can then get article description, date, title, price &#8230; a la Froogle.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-28153</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Wed, 13 Apr 2005 02:57:54 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-28153</guid>
		<description>missing template part: 
&#60;div class="buying"  pseudotal:content="title_authors" pseudotal:option="FULL_CONTENT"&#62;</description>
		<content:encoded><![CDATA[<p>missing template part:<br />
&lt;div class=&#8221;buying&#8221;  pseudotal:content=&#8221;title_authors&#8221; pseudotal:option=&#8221;FULL_CONTENT&#8221;&gt;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-28152</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Wed, 13 Apr 2005 02:56:36 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-28152</guid>
		<description>since title and authors are mixed together in the div node, so I want to get the whole div tag and process it myself.  In my template, i have . but my script still treats title_authors regularly and were not returning whole html string. after I put in my change, then it works return the whole div tag.</description>
		<content:encoded><![CDATA[<p>since title and authors are mixed together in the div node, so I want to get the whole div tag and process it myself.  In my template, i have . but my script still treats title_authors regularly and were not returning whole html string. after I put in my change, then it works return the whole div tag.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-27967</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 12 Apr 2005 08:55:27 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-27967</guid>
		<description>taprackbang : thank you for your suggestion ; it looks better than the current version. But why do you say that the option wasn't working for you ? Please describe the problem you were encountering.</description>
		<content:encoded><![CDATA[<p>taprackbang : thank you for your suggestion ; it looks better than the current version. But why do you say that the option wasn&#8217;t working for you ? Please describe the problem you were encountering.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-25775</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Thu, 07 Apr 2005 00:35:44 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-25775</guid>
		<description>the option wasn't working for me. then I put one line:

       try:
          option = finish['pseudotal:option']
          path.append(option)  # my change

near the end of  template.shortest_path(), and it works, becuase extract() looks for that option value. what u think?</description>
		<content:encoded><![CDATA[<p>the option wasn&#8217;t working for me. then I put one line:</p>
<p>       try:<br />
          option = finish['pseudotal:option']<br />
          path.append(option)  # my change</p>
<p>near the end of  template.shortest_path(), and it works, becuase extract() looks for that option value. what u think?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-19566</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Wed, 16 Mar 2005 11:01:36 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-19566</guid>
		<description>Quick notice : Leonard Richardson (BeautifulSoup's author) gave me a positive feedback about my suggested modifications. He is to include these features into the version 2.0 of BeautifulSoup he is working on. I don't plan on making any additional contribution on BeautifulSoup in the near future but I will certainly update this web parser as soon as v.2.0 of BS is available. Thank you Leonard !</description>
		<content:encoded><![CDATA[<p>Quick notice : Leonard Richardson (BeautifulSoup&#8217;s author) gave me a positive feedback about my suggested modifications. He is to include these features into the version 2.0 of BeautifulSoup he is working on. I don&#8217;t plan on making any additional contribution on BeautifulSoup in the near future but I will certainly update this web parser as soon as v.2.0 of BS is available. Thank you Leonard !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-19466</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 14 Mar 2005 08:45:49 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-19466</guid>
		<description>taprackbang : I should package this stuff so that it gets installed more easily but I don't have any precise idea about how to do this. Regarding malformed HTML (Amazon's mismatched form tag), I suggest that you write an appropriate regex in a "MyProcessor" class-like preprocessor &lt;a href="http://sig.levillage.org/?p=588" rel="nofollow"&gt;as suggested in the first part of this article&lt;/a&gt;. Your pre-processor will help crawling Amazon and saving its pages in a more adequate format.

Thanks all for your positive feedback : keep on commenting, I enjoy that a lot !  :-)</description>
		<content:encoded><![CDATA[<p>taprackbang : I should package this stuff so that it gets installed more easily but I don&#8217;t have any precise idea about how to do this. Regarding malformed HTML (Amazon&#8217;s mismatched form tag), I suggest that you write an appropriate regex in a &#8220;MyProcessor&#8221; class-like preprocessor <a href="http://sig.levillage.org/?p=588">as suggested in the first part of this article</a>. Your pre-processor will help crawling Amazon and saving its pages in a more adequate format.</p>
<p>Thanks all for your positive feedback : keep on commenting, I enjoy that a lot !  :-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: taprackbang</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-19343</link>
		<dc:creator>taprackbang</dc:creator>
		<pubDate>Sat, 12 Mar 2005 17:44:05 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-19343</guid>
		<description>This is great. After spending great deal of time to setup all the prerequisite packages, i was able to scrape the amazon's webpage to get book title, price easily.  However, amazon's webpage has mismatched form tag which caused tidy to choke, so I had to manually remove all the form tags before tidy and web_parser function calls.

It works right out of box, and made web scrapping so much easier. Great work. Thanks!! </description>
		<content:encoded><![CDATA[<p>This is great. After spending great deal of time to setup all the prerequisite packages, i was able to scrape the amazon&#8217;s webpage to get book title, price easily.  However, amazon&#8217;s webpage has mismatched form tag which caused tidy to choke, so I had to manually remove all the form tags before tidy and web_parser function calls.</p>
<p>It works right out of box, and made web scrapping so much easier. Great work. Thanks!!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kedai</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-19334</link>
		<dc:creator>kedai</dc:creator>
		<pubDate>Sat, 12 Mar 2005 15:04:09 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-19334</guid>
		<description>nice.  thanks for the link to beautiful soup.  probably will try and use that as a parser for KebasData (a zope product to scrape the web, what else).

currently, KebasData uses regex, and as noted by Leornard Richardson, regex is a double edge sword; it can help and it can also cause more trouble ;)

</description>
		<content:encoded><![CDATA[<p>nice.  thanks for the link to beautiful soup.  probably will try and use that as a parser for KebasData (a zope product to scrape the web, what else).</p>
<p>currently, KebasData uses regex, and as noted by Leornard Richardson, regex is a double edge sword; it can help and it can also cause more trouble ;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Reinout van Rees</title>
		<link>http://www.akasig.org/2005/03/11/web-scraping-with-python-part-ii/#comment-18943</link>
		<dc:creator>Reinout van Rees</dc:creator>
		<pubDate>Fri, 11 Mar 2005 12:54:41 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=599#comment-18943</guid>
		<description>Now that's a nice solution! Halfway your article I had a real AH!!!-moment : this is simply a good idea. http://vanrees.org/weblog/1110545603</description>
		<content:encoded><![CDATA[<p>Now that&#8217;s a nice solution! Halfway your article I had a real AH!!!-moment : this is simply a good idea. <a href="http://vanrees.org/weblog/1110545603">http://vanrees.org/weblog/1110545603</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>
