<?xml version="1.0" encoding="ISO-8859-15"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Web scraping with python (part 1 : crawling)</title>
	<atom:link href="http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/</link>
	<description>Innover, servir, entreprendre.</description>
	<pubDate>Thu, 04 Dec 2008 01:28:27 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
		<item>
		<title>By: mehdi</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-133176</link>
		<dc:creator>mehdi</dc:creator>
		<pubDate>Tue, 16 Sep 2008 14:51:33 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-133176</guid>
		<description>I want a sdk or dll that convert speech to phonetic such as mbrola that convert phonetic to speech . Thanks a lot</description>
		<content:encoded><![CDATA[<p>I want a sdk or dll that convert speech to phonetic such as mbrola that convert phonetic to speech . Thanks a lot</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-132136</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 25 Aug 2008 12:07:37 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-132136</guid>
		<description>Sorry : no PHP version available unless you code one ! :)</description>
		<content:encoded><![CDATA[<p>Sorry : no PHP version available unless you code one ! :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sunny</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-132122</link>
		<dc:creator>sunny</dc:creator>
		<pubDate>Mon, 25 Aug 2008 07:04:05 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-132122</guid>
		<description>can i have the code for PHP</description>
		<content:encoded><![CDATA[<p>can i have the code for PHP</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe Elizondo</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-99131</link>
		<dc:creator>Joe Elizondo</dc:creator>
		<pubDate>Wed, 11 Jul 2007 20:46:29 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-99131</guid>
		<description>No problem, if you ever make a version of the parser working with the newest beautiful soup release let me know! Still trying to tweak that in my spare time.</description>
		<content:encoded><![CDATA[<p>No problem, if you ever make a version of the parser working with the newest beautiful soup release let me know! Still trying to tweak that in my spare time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-99080</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Wed, 11 Jul 2007 13:49:27 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-99080</guid>
		<description>Hey Joe, this is great! Thanks a lot for that update.</description>
		<content:encoded><![CDATA[<p>Hey Joe, this is great! Thanks a lot for that update.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-83469</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Fri, 16 Mar 2007 01:54:09 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-83469</guid>
		<description>Tweaked Code to work with new versions of scripts. Working as of March 15 2007:

&lt;code&gt;
from mechanize import Browser,LinkNotFoundError
from ClientCookie import BaseHandler
from StringIO import StringIO
# import tidy
#
import sys
import re
from time import gmtime, strftime
#
# The following two line is specific to the site you want to crawl
# it provides some capabilities to your crawler for it to be able
# to understand the meaning of the data it is crawling ;
# as an example for knowing the age of the crawled resource
#
from datetime import date
# from my_parser import parsed_resource
#
"""
 Let's declare some customized pre-processors.
 These are useful when the HTML you are crawling through is not clean enough for mechanize.
 When you crawl through bad HTML, mechanize often raises errors.
 So either you tidy it with a strict tidy module (see TidyProcessor)
 or you tidy some errors you identified "by hand" (see MyProcessor).
 Note that because the tidy module is quite strict on HTML, it may change the whole
 structure of the page you are dealing with. As an example, in bad HTML, you may encounter
 nested forms or forms nested in tables or tables nested in forms. Tidying them may produce
 unintended results such as closing the form too early or making it empty. This is the reason
 you may have to use MyProcessor instead of TidyProcessor.
"""
#
class FakeResponse:
      def __init__(self, resp, nudata):
          self._resp = resp
          self._sio = StringIO(nudata)
#
      def __getattr__(self, name):
          try:
              return getattr(self._sio, name)
          except AttributeError:
              return getattr(self._resp, name)
#
class TidyProcessor(BaseHandler):
      def http_response(self, request, response):
          options = dict(output_xhtml=1,
                   add_xml_decl=1,
                   indent=1,
                   output_encoding='utf8',
                   input_encoding='latin1',
                   force_output=1
                   )
          r = tidy.parseString(response.read(), **options)
          return FakeResponse(response, str(r))
      https_response = http_response
#
class MyProcessor(BaseHandler):
      def http_response(self, request, response):
          r = response.read()
          r = r.replace('"image""','"image"')
          r = r.replace('"','"')
          return FakeResponse(response, r)
      https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())
#
""""
 Let's declare some utility methods that will enhance mechanize browsing capabilities
"""
#
def find(response,searchst):
    response.seek(0)
    lr = response.read()
    return re.search(searchst, lr, re.I)
#
def save_response(response,kw='file'):
    """Saves last response to timestamped file"""
    name = strftime("%Y%m%d%H%M%S_",gmtime())
    name = name + kw + '.html'
    f = open('./'+name,'w')
    response.seek(0)
    f.write(response.read())
    f.close
    return "Response saved as %s" % name
#
"""
Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl.
"""
#
def dvspot_crawl():
    """
     Here starts the browsing session.
     For every move, I could have put as a comment an equivalent PBP command line.
     PBP is a nice scripting layer on top of mechanize.
     But it does not allow looping or conditional browsing.
     So I preferred scripting directly with mechanize instead of using PBP
     and then adding an additional layer of scripting on top of it.
    """
#
    MAX_NR_OF_ITEMS_PER_SESSION = 500
    #
    # Go to home page
    #
    b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&#38;start=0")
    #
    # Navigate through the paginated list of cameras
    #
    next_page = 0
    while next_page == 0:
     #
     # Display and save details of every listed item
     #
     url = b.geturl()
     next_element = 0
     while next_element &#62;= 0:
      try:
       response1 = b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)
       next_element = next_element + 1
       print save_response(response1,"dvspot_camera_"+str(next_element))
       b.open(url)
       # if you crawled too many items, stop crawling
       if next_element*next_page &#62; MAX_NR_OF_ITEMS_PER_SESSION:
          next_element = -1
          next_page = -1
      except LinkNotFoundError:
       # You reached the last item in this page
       next_element = -1
    #
     try:
      b.open(url)
      response2 = b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)
      print "processing Next Page"
     except LinkNotFoundError:
      # You reached the last page of the listing of items
      next_page = -1
    #
    return
#
#
#
if __name__ == '__main__':
#
    """ Note that you may need to specify your proxy first.
    On windows, you do :
    set HTTP_PROXY=http://proxyname.bigcorp.com:8080
    """
    #
    dvspot_crawl()
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;</description>
		<content:encoded><![CDATA[<p>Tweaked Code to work with new versions of scripts. Working as of March 15 2007:</p>
<p><code><br />
from mechanize import Browser,LinkNotFoundError<br />
from ClientCookie import BaseHandler<br />
from StringIO import StringIO<br />
# import tidy<br />
#<br />
import sys<br />
import re<br />
from time import gmtime, strftime<br />
#<br />
# The following two line is specific to the site you want to crawl<br />
# it provides some capabilities to your crawler for it to be able<br />
# to understand the meaning of the data it is crawling ;<br />
# as an example for knowing the age of the crawled resource<br />
#<br />
from datetime import date<br />
# from my_parser import parsed_resource<br />
#<br />
"""<br />
 Let's declare some customized pre-processors.<br />
 These are useful when the HTML you are crawling through is not clean enough for mechanize.<br />
 When you crawl through bad HTML, mechanize often raises errors.<br />
 So either you tidy it with a strict tidy module (see TidyProcessor)<br />
 or you tidy some errors you identified "by hand" (see MyProcessor).<br />
 Note that because the tidy module is quite strict on HTML, it may change the whole<br />
 structure of the page you are dealing with. As an example, in bad HTML, you may encounter<br />
 nested forms or forms nested in tables or tables nested in forms. Tidying them may produce<br />
 unintended results such as closing the form too early or making it empty. This is the reason<br />
 you may have to use MyProcessor instead of TidyProcessor.<br />
"""<br />
#<br />
class FakeResponse:<br />
      def __init__(self, resp, nudata):<br />
          self._resp = resp<br />
          self._sio = StringIO(nudata)<br />
#<br />
      def __getattr__(self, name):<br />
          try:<br />
              return getattr(self._sio, name)<br />
          except AttributeError:<br />
              return getattr(self._resp, name)<br />
#<br />
class TidyProcessor(BaseHandler):<br />
      def http_response(self, request, response):<br />
          options = dict(output_xhtml=1,<br />
                   add_xml_decl=1,<br />
                   indent=1,<br />
                   output_encoding='utf8',<br />
                   input_encoding='latin1',<br />
                   force_output=1<br />
                   )<br />
          r = tidy.parseString(response.read(), **options)<br />
          return FakeResponse(response, str(r))<br />
      https_response = http_response<br />
#<br />
class MyProcessor(BaseHandler):<br />
      def http_response(self, request, response):<br />
          r = response.read()<br />
          r = r.replace('"image""','"image"')<br />
          r = r.replace('"','"')<br />
          return FakeResponse(response, r)<br />
      https_response = http_response<br />
#<br />
# Open a browser and optionally choose a customized HTML pre-processor<br />
b = Browser()<br />
b.add_handler(MyProcessor())<br />
#<br />
""""<br />
 Let's declare some utility methods that will enhance mechanize browsing capabilities<br />
"""<br />
#<br />
def find(response,searchst):<br />
    response.seek(0)<br />
    lr = response.read()<br />
    return re.search(searchst, lr, re.I)<br />
#<br />
def save_response(response,kw='file'):<br />
    """Saves last response to timestamped file"""<br />
    name = strftime("%Y%m%d%H%M%S_",gmtime())<br />
    name = name + kw + '.html'<br />
    f = open('./'+name,'w')<br />
    response.seek(0)<br />
    f.write(response.read())<br />
    f.close<br />
    return "Response saved as %s" % name<br />
#<br />
"""<br />
Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl.<br />
"""<br />
#<br />
def dvspot_crawl():<br />
    """<br />
     Here starts the browsing session.<br />
     For every move, I could have put as a comment an equivalent PBP command line.<br />
     PBP is a nice scripting layer on top of mechanize.<br />
     But it does not allow looping or conditional browsing.<br />
     So I preferred scripting directly with mechanize instead of using PBP<br />
     and then adding an additional layer of scripting on top of it.<br />
    """<br />
#<br />
    MAX_NR_OF_ITEMS_PER_SESSION = 500<br />
    #<br />
    # Go to home page<br />
    #<br />
    b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&amp;start=0")<br />
    #<br />
    # Navigate through the paginated list of cameras<br />
    #<br />
    next_page = 0<br />
    while next_page == 0:<br />
     #<br />
     # Display and save details of every listed item<br />
     #<br />
     url = b.geturl()<br />
     next_element = 0<br />
     while next_element &gt;= 0:<br />
      try:<br />
       response1 = b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)<br />
       next_element = next_element + 1<br />
       print save_response(response1,"dvspot_camera_"+str(next_element))<br />
       b.open(url)<br />
       # if you crawled too many items, stop crawling<br />
       if next_element*next_page &gt; MAX_NR_OF_ITEMS_PER_SESSION:<br />
          next_element = -1<br />
          next_page = -1<br />
      except LinkNotFoundError:<br />
       # You reached the last item in this page<br />
       next_element = -1<br />
    #<br />
     try:<br />
      b.open(url)<br />
      response2 = b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)<br />
      print "processing Next Page"<br />
     except LinkNotFoundError:<br />
      # You reached the last page of the listing of items<br />
      next_page = -1<br />
    #<br />
    return<br />
#<br />
#<br />
#<br />
if __name__ == '__main__':<br />
#<br />
    """ Note that you may need to specify your proxy first.<br />
    On windows, you do :<br />
    set HTTP_PROXY=http://proxyname.bigcorp.com:8080<br />
    """<br />
    #<br />
    dvspot_crawl()<br />
</code><code></code></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-83467</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Fri, 16 Mar 2007 01:06:22 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-83467</guid>
		<description>BaseProcessor seems to be causing problems in the code. Does it still exist? Is there a workaround?</description>
		<content:encoded><![CDATA[<p>BaseProcessor seems to be causing problems in the code. Does it still exist? Is there a workaround?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: AkaSig &#187; Blog Archive &#187; Web scraping, web mashing</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-82549</link>
		<dc:creator>AkaSig &#187; Blog Archive &#187; Web scraping, web mashing</dc:creator>
		<pubDate>Thu, 08 Mar 2007 11:51:34 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-82549</guid>
		<description>[...] 5 Ways to Mix, Rip, and Mash Your Data introduces promising web and desktop applications that extract structured data feeds from web sites and mix them together into something possibly useful to you. Think of things like getting filtered Monster job ads as a convenient RSS feed, along with job ads from your other favorite job sites. This reminds me my Python hacks for automating web crawling and web scraping. Sometimes, I wish I could find time for working a bit further on that&#8230; [...]</description>
		<content:encoded><![CDATA[<p>[...] 5 Ways to Mix, Rip, and Mash Your Data introduces promising web and desktop applications that extract structured data feeds from web sites and mix them together into something possibly useful to you. Think of things like getting filtered Monster job ads as a convenient RSS feed, along with job ads from your other favorite job sites. This reminds me my Python hacks for automating web crawling and web scraping. Sometimes, I wish I could find time for working a bit further on that&#8230; [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-40657</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Tue, 23 Aug 2005 07:01:25 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-40657</guid>
		<description>Alex, I think your SWEA is great. Your screencast is really impressive. I wish we had an equivalent capability to use Mozilla's Gecko engine from Python. People preferring GUIs to raw python scripts should definitely have a look at SWEA. Thanks for your feedback, Alex.</description>
		<content:encoded><![CDATA[<p>Alex, I think your SWEA is great. Your screencast is really impressive. I wish we had an equivalent capability to use Mozilla&#8217;s Gecko engine from Python. People preferring GUIs to raw python scripts should definitely have a look at SWEA. Thanks for your feedback, Alex.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Alex</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-40647</link>
		<dc:creator>Alex</dc:creator>
		<pubDate>Thu, 18 Aug 2005 17:26:50 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-40647</guid>
		<description>Take a look at SWExplorerAutomation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer. </description>
		<content:encoded><![CDATA[<p>Take a look at SWExplorerAutomation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: agb</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-40625</link>
		<dc:creator>agb</dc:creator>
		<pubDate>Fri, 05 Aug 2005 15:04:37 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-40625</guid>
		<description>I think the reason for using mechanoid is that it seems to be more self contained.  Looking at the mechanize dox, install consists of a few more packages in addition to mechanize. So, mechanoid seems easier. 

But this is the perspective of a guy that's researching session scraping techniques and has yet to use either package, so ymmv.</description>
		<content:encoded><![CDATA[<p>I think the reason for using mechanoid is that it seems to be more self contained.  Looking at the mechanize dox, install consists of a few more packages in addition to mechanize. So, mechanoid seems easier. </p>
<p>But this is the perspective of a guy that&#8217;s researching session scraping techniques and has yet to use either package, so ymmv.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lethalman</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-24774</link>
		<dc:creator>Lethalman</dc:creator>
		<pubDate>Sun, 03 Apr 2005 12:35:13 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-24774</guid>
		<description>I'm using mechanize too and i would like to try mechanoid but i didn't find examples on how to use it... does someone know where to find very simple examples to use mechanoid?</description>
		<content:encoded><![CDATA[<p>I&#8217;m using mechanize too and i would like to try mechanoid but i didn&#8217;t find examples on how to use it&#8230; does someone know where to find very simple examples to use mechanoid?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-20735</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Mon, 21 Mar 2005 13:40:14 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-20735</guid>
		<description>Thank you Nick, it seems that the upgrade process of my weblog engine was not that smooth. I'll have to fix that... :-(</description>
		<content:encoded><![CDATA[<p>Thank you Nick, it seems that the upgrade process of my weblog engine was not that smooth. I&#8217;ll have to fix that&#8230; :-(</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nick</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-20459</link>
		<dc:creator>Nick</dc:creator>
		<pubDate>Sun, 20 Mar 2005 00:18:18 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-20459</guid>
		<description>Source code is messed up a bit by blog engine. All quote symbols were hide by oblique stroke. </description>
		<content:encoded><![CDATA[<p>Source code is messed up a bit by blog engine. All quote symbols were hide by oblique stroke.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sig</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-19564</link>
		<dc:creator>Sig</dc:creator>
		<pubDate>Wed, 16 Mar 2005 09:08:24 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-19564</guid>
		<description>Thank you Phil and Roman for your tips.

Phil, if ever you remember why you thought that Mechanoid was better than Mechanize, please tell us !

-- Sig</description>
		<content:encoded><![CDATA[<p>Thank you Phil and Roman for your tips.</p>
<p>Phil, if ever you remember why you thought that Mechanoid was better than Mechanize, please tell us !</p>
<p>&#8211; Sig</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phil</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-19563</link>
		<dc:creator>Phil</dc:creator>
		<pubDate>Wed, 16 Mar 2005 05:12:31 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-19563</guid>
		<description>I've used Mechanoid, a fork of Mechanize, with success. (I can't remember why I chose it over Mechanoid, but just thought it was worth pointing out.)

http://mechanoid.sourceforge.net/

--Phil.</description>
		<content:encoded><![CDATA[<p>I&#8217;ve used Mechanoid, a fork of Mechanize, with success. (I can&#8217;t remember why I chose it over Mechanoid, but just thought it was worth pointing out.)</p>
<p><a href="http://mechanoid.sourceforge.net/">http://mechanoid.sourceforge.net/</a></p>
<p>&#8211;Phil.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roman</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-19562</link>
		<dc:creator>Roman</dc:creator>
		<pubDate>Tue, 15 Mar 2005 23:03:55 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-19562</guid>
		<description>
Maybe you might get some additional ideas from this commerical product:

Lixto Visual Wrapper 

This component is responsible for locating and extracting desired data from the Web. While working with Lixto Visual Wrapper, an operator generates a wrapper program which converts relevant information from HTML to XML. Lixto Visual Wrapper generates a so-called XML companion for given Web pages.


http://www.lixto.com/show.php?page=vw_architecture&#38;lg=EN

Cheers,
Roman</description>
		<content:encoded><![CDATA[<p>Maybe you might get some additional ideas from this commerical product:</p>
<p>Lixto Visual Wrapper </p>
<p>This component is responsible for locating and extracting desired data from the Web. While working with Lixto Visual Wrapper, an operator generates a wrapper program which converts relevant information from HTML to XML. Lixto Visual Wrapper generates a so-called XML companion for given Web pages.</p>
<p><a href="http://www.lixto.com/show.php?page=vw_architecture&amp;lg=EN">http://www.lixto.com/show.php?page=vw_architecture&amp;lg=EN</a></p>
<p>Cheers,<br />
Roman</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: AkaSig  &#187; Blog Archive   &#187; Web scraping with Python (part II)</title>
		<link>http://www.akasig.org/2004/12/29/web-scraping-with-python-part-1-crawling/#comment-18942</link>
		<dc:creator>AkaSig  &#187; Blog Archive   &#187; Web scraping with Python (part II)</dc:creator>
		<pubDate>Fri, 11 Mar 2005 10:31:44 +0000</pubDate>
		<guid isPermaLink="false">http://sig.levillage.org/?p=588#comment-18942</guid>
		<description>[...] omments / pas de commentaires 			 		 	 		 			Web scraping with Python (part II) 	 			 					The first part of this article dealt with retrieving HTML pages from the web with the help of a mechanize-propell [...]</description>
		<content:encoded><![CDATA[<p>[...] omments / pas de commentaires 			 		 	 		 			Web scraping with Python (part II) 	 			 					The first part of this article dealt with retrieving HTML pages from the web with the help of a mechanize-propell [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
