<?xml version="1.0" encoding="ISO-8859-15"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Web scraping with python (part 1 : crawling)</title>
	<atom:link href="http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/</link>
	<description>Innover, servir, entreprendre.</description>
	<lastBuildDate>Mon, 15 Mar 2010 15:01:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Andrew A. Sailer</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-157552</link>
		<dc:creator>Andrew A. Sailer</dc:creator>
		<pubDate>Sat, 06 Mar 2010 23:42:42 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-157552</guid>
		<description>This is great! Thanks for your article. I am new at python and this is a big help.</description>
		<content:encoded><![CDATA[<p>This is great! Thanks for your article. I am new at python and this is a big help.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jorge Gonzalez</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-151262</link>
		<dc:creator>Jorge Gonzalez</dc:creator>
		<pubDate>Mon, 21 Sep 2009 21:44:31 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-151262</guid>
		<description>Another very good fit for this task is Scrapy, which is much faster than Mechanize (for crawling) and has a very nice and well documented API:

http://scrapy.org

It also has its own mechanism for extracting data from web pages using XPath, which is arguably more convenient (and definitely faster) than BeautifulSoup, but you can keep using BeautifulSoup for extracting data and just use its crawler.</description>
		<content:encoded><![CDATA[<p>Another very good fit for this task is Scrapy, which is much faster than Mechanize (for crawling) and has a very nice and well documented API:</p>
<p><a href="http://scrapy.org">http://scrapy.org</a></p>
<p>It also has its own mechanism for extracting data from web pages using XPath, which is arguably more convenient (and definitely faster) than BeautifulSoup, but you can keep using BeautifulSoup for extracting data and just use its crawler.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-138746</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Thu, 05 Feb 2009 00:03:43 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-138746</guid>
		<description>Lin: message well received. I just answered it a couple of minutes ago. In brief : can&#039;t help atm because of other priorities. Give dapper.net a try, too. Good luck !</description>
		<content:encoded><![CDATA[<p>Lin: message well received. I just answered it a couple of minutes ago. In brief : can&#8217;t help atm because of other priorities. Give dapper.net a try, too. Good luck !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lin</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-138740</link>
		<dc:creator>Lin</dc:creator>
		<pubDate>Wed, 04 Feb 2009 22:19:24 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-138740</guid>
		<description>Hi Sig, 

I sent you an email to sig@sig.levillage.org yesterday but I am not sure if that is the right account. If you get my email, could you send a reply? 

Thank you, 
Lin</description>
		<content:encoded><![CDATA[<p>Hi Sig, </p>
<p>I sent you an email to <a href="mailto:sig@sig.levillage.org">sig@sig.levillage.org</a> yesterday but I am not sure if that is the right account. If you get my email, could you send a reply? </p>
<p>Thank you,<br />
Lin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-138483</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Thu, 29 Jan 2009 20:03:50 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-138483</guid>
		<description>mehdi : you are not on the speech related page.

Lin : OK. I see dvspot.com is dead. For any site (amazon included) you have to write your own crawler method. Take the dvspot_crawl() method above as an example and write your own amazon_crawl() method. Hopefully Amazon won&#039;t require you to tweak preprocessors and its HTML will be understood by the mechanize framework without preprocessing.

Furthermore, this article is now 4 years old and the underlying libraries have evolved quite a bit (mechanize and beautifulsoup). If I were you, I would also try to adapt this example code to new versions of these libraries : this may help and simplify the work required for your amazon_crawl() method.

Good luck with this. Please publish your amazon example here so that other readers can have more examples to take inspiration from.</description>
		<content:encoded><![CDATA[<p>mehdi : you are not on the speech related page.</p>
<p>Lin : OK. I see dvspot.com is dead. For any site (amazon included) you have to write your own crawler method. Take the dvspot_crawl() method above as an example and write your own amazon_crawl() method. Hopefully Amazon won&#8217;t require you to tweak preprocessors and its HTML will be understood by the mechanize framework without preprocessing.</p>
<p>Furthermore, this article is now 4 years old and the underlying libraries have evolved quite a bit (mechanize and beautifulsoup). If I were you, I would also try to adapt this example code to new versions of these libraries : this may help and simplify the work required for your amazon_crawl() method.</p>
<p>Good luck with this. Please publish your amazon example here so that other readers can have more examples to take inspiration from.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lin</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-138473</link>
		<dc:creator>Lin</dc:creator>
		<pubDate>Thu, 29 Jan 2009 16:58:48 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-138473</guid>
		<description>By the way, this is the website I used. 

http://www.amazon.com/s/ref=nb_ss_etk_ce_av_?url=node%3D1065836%2C172630&amp;field-keywords=&amp;x=12&amp;y=20

Could you help me out? Thank you a lot!</description>
		<content:encoded><![CDATA[<p>By the way, this is the website I used. </p>
<p><a href="http://www.amazon.com/s/ref=nb_ss_etk_ce_av_?url=node%3D1065836%2C172630&amp;field-keywords=&amp;x=12&amp;y=20">http://www.amazon.com/s/ref=nb_ss_etk_ce_av_?url=node%3D1065836%2C172630&amp;field-keywords=&amp;x=12&amp;y=20</a></p>
<p>Could you help me out? Thank you a lot!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lin</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-138472</link>
		<dc:creator>Lin</dc:creator>
		<pubDate>Thu, 29 Jan 2009 16:45:55 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-138472</guid>
		<description>Hi Sig, 

The example website is not working anymore. I tried to use amazon.com but the code seems not working properly. 

lin</description>
		<content:encoded><![CDATA[<p>Hi Sig, </p>
<p>The example website is not working anymore. I tried to use amazon.com but the code seems not working properly. </p>
<p>lin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mehdi</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-133176</link>
		<dc:creator>mehdi</dc:creator>
		<pubDate>Tue, 16 Sep 2008 14:51:33 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-133176</guid>
		<description>I want a sdk or dll that convert speech to phonetic such as mbrola that convert phonetic to speech . Thanks a lot</description>
		<content:encoded><![CDATA[<p>I want a sdk or dll that convert speech to phonetic such as mbrola that convert phonetic to speech . Thanks a lot</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-132136</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 25 Aug 2008 12:07:37 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-132136</guid>
		<description>Sorry : no PHP version available unless you code one ! :)</description>
		<content:encoded><![CDATA[<p>Sorry : no PHP version available unless you code one ! :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sunny</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-132122</link>
		<dc:creator>sunny</dc:creator>
		<pubDate>Mon, 25 Aug 2008 07:04:05 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-132122</guid>
		<description>can i have the code for PHP</description>
		<content:encoded><![CDATA[<p>can i have the code for PHP</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe Elizondo</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-99131</link>
		<dc:creator>Joe Elizondo</dc:creator>
		<pubDate>Wed, 11 Jul 2007 20:46:29 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-99131</guid>
		<description>No problem, if you ever make a version of the parser working with the newest beautiful soup release let me know! Still trying to tweak that in my spare time.</description>
		<content:encoded><![CDATA[<p>No problem, if you ever make a version of the parser working with the newest beautiful soup release let me know! Still trying to tweak that in my spare time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-99080</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Wed, 11 Jul 2007 13:49:27 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-99080</guid>
		<description>Hey Joe, this is great! Thanks a lot for that update.</description>
		<content:encoded><![CDATA[<p>Hey Joe, this is great! Thanks a lot for that update.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-83469</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Fri, 16 Mar 2007 01:54:09 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-83469</guid>
		<description>Tweaked Code to work with new versions of scripts. Working as of March 15 2007:

&lt;code&gt;
from mechanize import Browser,LinkNotFoundError
from ClientCookie import BaseHandler
from StringIO import StringIO
# import tidy
#
import sys
import re
from time import gmtime, strftime
#
# The following two line is specific to the site you want to crawl
# it provides some capabilities to your crawler for it to be able
# to understand the meaning of the data it is crawling ;
# as an example for knowing the age of the crawled resource
#
from datetime import date
# from my_parser import parsed_resource
#
&quot;&quot;&quot;
 Let&#039;s declare some customized pre-processors.
 These are useful when the HTML you are crawling through is not clean enough for mechanize.
 When you crawl through bad HTML, mechanize often raises errors.
 So either you tidy it with a strict tidy module (see TidyProcessor)
 or you tidy some errors you identified &quot;by hand&quot; (see MyProcessor).
 Note that because the tidy module is quite strict on HTML, it may change the whole
 structure of the page you are dealing with. As an example, in bad HTML, you may encounter
 nested forms or forms nested in tables or tables nested in forms. Tidying them may produce
 unintended results such as closing the form too early or making it empty. This is the reason
 you may have to use MyProcessor instead of TidyProcessor.
&quot;&quot;&quot;
#
class FakeResponse:
      def __init__(self, resp, nudata):
          self._resp = resp
          self._sio = StringIO(nudata)
#
      def __getattr__(self, name):
          try:
              return getattr(self._sio, name)
          except AttributeError:
              return getattr(self._resp, name)
#
class TidyProcessor(BaseHandler):
      def http_response(self, request, response):
          options = dict(output_xhtml=1,
                   add_xml_decl=1,
                   indent=1,
                   output_encoding=&#039;utf8&#039;,
                   input_encoding=&#039;latin1&#039;,
                   force_output=1
                   )
          r = tidy.parseString(response.read(), **options)
          return FakeResponse(response, str(r))
      https_response = http_response
#
class MyProcessor(BaseHandler):
      def http_response(self, request, response):
          r = response.read()
          r = r.replace(&#039;&quot;image&quot;&quot;&#039;,&#039;&quot;image&quot;&#039;)
          r = r.replace(&#039;&quot;&#039;,&#039;&quot;&#039;)
          return FakeResponse(response, r)
      https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())
#
&quot;&quot;&quot;&quot;
 Let&#039;s declare some utility methods that will enhance mechanize browsing capabilities
&quot;&quot;&quot;
#
def find(response,searchst):
    response.seek(0)
    lr = response.read()
    return re.search(searchst, lr, re.I)
#
def save_response(response,kw=&#039;file&#039;):
    &quot;&quot;&quot;Saves last response to timestamped file&quot;&quot;&quot;
    name = strftime(&quot;%Y%m%d%H%M%S_&quot;,gmtime())
    name = name + kw + &#039;.html&#039;
    f = open(&#039;./&#039;+name,&#039;w&#039;)
    response.seek(0)
    f.write(response.read())
    f.close
    return &quot;Response saved as %s&quot; % name
#
&quot;&quot;&quot;
Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl.
&quot;&quot;&quot;
#
def dvspot_crawl():
    &quot;&quot;&quot;
     Here starts the browsing session.
     For every move, I could have put as a comment an equivalent PBP command line.
     PBP is a nice scripting layer on top of mechanize.
     But it does not allow looping or conditional browsing.
     So I preferred scripting directly with mechanize instead of using PBP
     and then adding an additional layer of scripting on top of it.
    &quot;&quot;&quot;
#
    MAX_NR_OF_ITEMS_PER_SESSION = 500
    #
    # Go to home page
    #
    b.open(&quot;http://www.dvspot.com/reviews/cameraList.php?listall=1&amp;start=0&quot;)
    #
    # Navigate through the paginated list of cameras
    #
    next_page = 0
    while next_page == 0:
     #
     # Display and save details of every listed item
     #
     url = b.geturl()
     next_element = 0
     while next_element &gt;= 0:
      try:
       response1 = b.follow_link(url_regex=re.compile(r&quot;cameraDetail&quot;), nr=next_element)
       next_element = next_element + 1
       print save_response(response1,&quot;dvspot_camera_&quot;+str(next_element))
       b.open(url)
       # if you crawled too many items, stop crawling
       if next_element*next_page &gt; MAX_NR_OF_ITEMS_PER_SESSION:
          next_element = -1
          next_page = -1
      except LinkNotFoundError:
       # You reached the last item in this page
       next_element = -1
    #
     try:
      b.open(url)
      response2 = b.follow_link(text_regex=re.compile(r&quot;Next Page&quot;), nr=0)
      print &quot;processing Next Page&quot;
     except LinkNotFoundError:
      # You reached the last page of the listing of items
      next_page = -1
    #
    return
#
#
#
if __name__ == &#039;__main__&#039;:
#
    &quot;&quot;&quot; Note that you may need to specify your proxy first.
    On windows, you do :
    set HTTP_PROXY=http://proxyname.bigcorp.com:8080
    &quot;&quot;&quot;
    #
    dvspot_crawl()
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;</description>
		<content:encoded><![CDATA[<p>Tweaked Code to work with new versions of scripts. Working as of March 15 2007:</p>
<p><code><br />
from mechanize import Browser,LinkNotFoundError<br />
from ClientCookie import BaseHandler<br />
from StringIO import StringIO<br />
# import tidy<br />
#<br />
import sys<br />
import re<br />
from time import gmtime, strftime<br />
#<br />
# The following two line is specific to the site you want to crawl<br />
# it provides some capabilities to your crawler for it to be able<br />
# to understand the meaning of the data it is crawling ;<br />
# as an example for knowing the age of the crawled resource<br />
#<br />
from datetime import date<br />
# from my_parser import parsed_resource<br />
#<br />
"""<br />
 Let's declare some customized pre-processors.<br />
 These are useful when the HTML you are crawling through is not clean enough for mechanize.<br />
 When you crawl through bad HTML, mechanize often raises errors.<br />
 So either you tidy it with a strict tidy module (see TidyProcessor)<br />
 or you tidy some errors you identified "by hand" (see MyProcessor).<br />
 Note that because the tidy module is quite strict on HTML, it may change the whole<br />
 structure of the page you are dealing with. As an example, in bad HTML, you may encounter<br />
 nested forms or forms nested in tables or tables nested in forms. Tidying them may produce<br />
 unintended results such as closing the form too early or making it empty. This is the reason<br />
 you may have to use MyProcessor instead of TidyProcessor.<br />
"""<br />
#<br />
class FakeResponse:<br />
      def __init__(self, resp, nudata):<br />
          self._resp = resp<br />
          self._sio = StringIO(nudata)<br />
#<br />
      def __getattr__(self, name):<br />
          try:<br />
              return getattr(self._sio, name)<br />
          except AttributeError:<br />
              return getattr(self._resp, name)<br />
#<br />
class TidyProcessor(BaseHandler):<br />
      def http_response(self, request, response):<br />
          options = dict(output_xhtml=1,<br />
                   add_xml_decl=1,<br />
                   indent=1,<br />
                   output_encoding='utf8',<br />
                   input_encoding='latin1',<br />
                   force_output=1<br />
                   )<br />
          r = tidy.parseString(response.read(), **options)<br />
          return FakeResponse(response, str(r))<br />
      https_response = http_response<br />
#<br />
class MyProcessor(BaseHandler):<br />
      def http_response(self, request, response):<br />
          r = response.read()<br />
          r = r.replace('"image""','"image"')<br />
          r = r.replace('"','"')<br />
          return FakeResponse(response, r)<br />
      https_response = http_response<br />
#<br />
# Open a browser and optionally choose a customized HTML pre-processor<br />
b = Browser()<br />
b.add_handler(MyProcessor())<br />
#<br />
""""<br />
 Let's declare some utility methods that will enhance mechanize browsing capabilities<br />
"""<br />
#<br />
def find(response,searchst):<br />
    response.seek(0)<br />
    lr = response.read()<br />
    return re.search(searchst, lr, re.I)<br />
#<br />
def save_response(response,kw='file'):<br />
    """Saves last response to timestamped file"""<br />
    name = strftime("%Y%m%d%H%M%S_",gmtime())<br />
    name = name + kw + '.html'<br />
    f = open('./'+name,'w')<br />
    response.seek(0)<br />
    f.write(response.read())<br />
    f.close<br />
    return "Response saved as %s" % name<br />
#<br />
"""<br />
Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl.<br />
"""<br />
#<br />
def dvspot_crawl():<br />
    """<br />
     Here starts the browsing session.<br />
     For every move, I could have put as a comment an equivalent PBP command line.<br />
     PBP is a nice scripting layer on top of mechanize.<br />
     But it does not allow looping or conditional browsing.<br />
     So I preferred scripting directly with mechanize instead of using PBP<br />
     and then adding an additional layer of scripting on top of it.<br />
    """<br />
#<br />
    MAX_NR_OF_ITEMS_PER_SESSION = 500<br />
    #<br />
    # Go to home page<br />
    #<br />
    b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&amp;start=0")<br />
    #<br />
    # Navigate through the paginated list of cameras<br />
    #<br />
    next_page = 0<br />
    while next_page == 0:<br />
     #<br />
     # Display and save details of every listed item<br />
     #<br />
     url = b.geturl()<br />
     next_element = 0<br />
     while next_element &gt;= 0:<br />
      try:<br />
       response1 = b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)<br />
       next_element = next_element + 1<br />
       print save_response(response1,"dvspot_camera_"+str(next_element))<br />
       b.open(url)<br />
       # if you crawled too many items, stop crawling<br />
       if next_element*next_page &gt; MAX_NR_OF_ITEMS_PER_SESSION:<br />
          next_element = -1<br />
          next_page = -1<br />
      except LinkNotFoundError:<br />
       # You reached the last item in this page<br />
       next_element = -1<br />
    #<br />
     try:<br />
      b.open(url)<br />
      response2 = b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)<br />
      print "processing Next Page"<br />
     except LinkNotFoundError:<br />
      # You reached the last page of the listing of items<br />
      next_page = -1<br />
    #<br />
    return<br />
#<br />
#<br />
#<br />
if __name__ == '__main__':<br />
#<br />
    """ Note that you may need to specify your proxy first.<br />
    On windows, you do :<br />
    set HTTP_PROXY=http://proxyname.bigcorp.com:8080<br />
    """<br />
    #<br />
    dvspot_crawl()<br />
</code><code></code></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-83467</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Fri, 16 Mar 2007 01:06:22 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-83467</guid>
		<description>BaseProcessor seems to be causing problems in the code. Does it still exist? Is there a workaround?</description>
		<content:encoded><![CDATA[<p>BaseProcessor seems to be causing problems in the code. Does it still exist? Is there a workaround?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: AkaSig &#187; Blog Archive &#187; Web scraping, web mashing</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-82549</link>
		<dc:creator>AkaSig &#187; Blog Archive &#187; Web scraping, web mashing</dc:creator>
		<pubDate>Thu, 08 Mar 2007 11:51:34 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-82549</guid>
		<description>[...] 5 Ways to Mix, Rip, and Mash Your Data introduces promising web and desktop applications that extract structured data feeds from web sites and mix them together into something possibly useful to you. Think of things like getting filtered Monster job ads as a convenient RSS feed, along with job ads from your other favorite job sites. This reminds me my Python hacks for automating web crawling and web scraping. Sometimes, I wish I could find time for working a bit further on that&#8230; [...]</description>
		<content:encoded><![CDATA[<p>[...] 5 Ways to Mix, Rip, and Mash Your Data introduces promising web and desktop applications that extract structured data feeds from web sites and mix them together into something possibly useful to you. Think of things like getting filtered Monster job ads as a convenient RSS feed, along with job ads from your other favorite job sites. This reminds me my Python hacks for automating web crawling and web scraping. Sometimes, I wish I could find time for working a bit further on that&#8230; [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-40657</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 23 Aug 2005 07:01:25 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-40657</guid>
		<description>Alex, I think your SWEA is great. Your screencast is really impressive. I wish we had an equivalent capability to use Mozilla&#039;s Gecko engine from Python. People preferring GUIs to raw python scripts should definitely have a look at SWEA. Thanks for your feedback, Alex.</description>
		<content:encoded><![CDATA[<p>Alex, I think your SWEA is great. Your screencast is really impressive. I wish we had an equivalent capability to use Mozilla&#8217;s Gecko engine from Python. People preferring GUIs to raw python scripts should definitely have a look at SWEA. Thanks for your feedback, Alex.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alex</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-40647</link>
		<dc:creator>Alex</dc:creator>
		<pubDate>Thu, 18 Aug 2005 17:26:50 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-40647</guid>
		<description>Take a look at SWExplorerAutomation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer. </description>
		<content:encoded><![CDATA[<p>Take a look at SWExplorerAutomation (<a href="http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA">http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA</a>). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: agb</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-40625</link>
		<dc:creator>agb</dc:creator>
		<pubDate>Fri, 05 Aug 2005 15:04:37 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-40625</guid>
		<description>I think the reason for using mechanoid is that it seems to be more self contained.  Looking at the mechanize dox, install consists of a few more packages in addition to mechanize. So, mechanoid seems easier. 

But this is the perspective of a guy that&#039;s researching session scraping techniques and has yet to use either package, so ymmv.</description>
		<content:encoded><![CDATA[<p>I think the reason for using mechanoid is that it seems to be more self contained.  Looking at the mechanize dox, install consists of a few more packages in addition to mechanize. So, mechanoid seems easier. </p>
<p>But this is the perspective of a guy that&#8217;s researching session scraping techniques and has yet to use either package, so ymmv.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lethalman</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-24774</link>
		<dc:creator>Lethalman</dc:creator>
		<pubDate>Sun, 03 Apr 2005 12:35:13 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-24774</guid>
		<description>I&#039;m using mechanize too and i would like to try mechanoid but i didn&#039;t find examples on how to use it... does someone know where to find very simple examples to use mechanoid?</description>
		<content:encoded><![CDATA[<p>I&#8217;m using mechanize too and i would like to try mechanoid but i didn&#8217;t find examples on how to use it&#8230; does someone know where to find very simple examples to use mechanoid?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/comment-page-1/#comment-20735</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 21 Mar 2005 13:40:14 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-20735</guid>
		<description>Thank you Nick, it seems that the upgrade process of my weblog engine was not that smooth. I&#039;ll have to fix that... :-(</description>
		<content:encoded><![CDATA[<p>Thank you Nick, it seems that the upgrade process of my weblog engine was not that smooth. I&#8217;ll have to fix that&#8230; :-(</p>
]]></content:encoded>
	</item>
</channel>
</rss>
