Ecartype est une forme originale de carnet web : il s’agit d’un carnet de conseils boursiers. A ajouter à la longue liste des usages innovants des carnets Web. Je suis neutre à positif sur Ecartype tant qu’ils s’inscrivent dans leur triangle haussier en matière d’usages innovants. Et, tant que j’y suis à découvrir un vocable abscons : attention au pull back au niveau de la ligne de cou, ça peut donner le torticolis. A quand un ecartype qui donnent des conseils boursiers au sujet du marché des communautés open source (“je suis neutre à positif sur Drupal”, “Plone s’inscrit dans un long triangle haussier”, etc.) ?
Archive for the ‘Architecture’ Category
Neutre à positif pour Ecartype ?
Thursday, March 31st, 2005SOA, tu m’auras pas !
Tuesday, March 29th, 2005C’est pas moi qui l’ai dit :
SOA : nouvel acronyme, vieux problème.
The CMS pseudo-stock market
Wednesday, March 23rd, 2005The Drupal people produced insightful stock-market-like statistics about the popularity of open source CMS packages (via the precious Amphi-Gouri). But their analysis mixes content management systems (Drupal, Plone) with blog engines (Wordpress) and bulletin boards (phpBB). Anyway, it shows that :
- “The popularity of most Free and Open Source CMS tools is in an upward trend.“
- Bulletin boards like phpBB is the most popular category, maybe the most mature and phpBB is the strong leader in this category
- In the CMS category, Mambo, Xoops, Drupal and Plone are direct competitors ; Mambo is ahead in terms of popularity, Plone is behind its PHP competitors which certainly benefit from the popularity of PHP compared to Python; PHP-Nuke and PostNuke are quickly loosing some ground.
- Wordpress is the most dynamic open source blog engine in terms of growth of popularity ; its community is exploding
My conclusion :
- if you want an open source bulletin board/community forum, then choose phpBB with no hesitation
- if you want a real content management system and are not religiously opposed to Python, then choose Plone, else stick with PHP and go Mambo (or Xoops ?)
- if you want an open source blog engine, then enjoy Wordpress
If feel like producing this kind of statistical analysis about the dynamics of open source communities is extremely valuable for organization and people considering several open source options (cf. the activity percentile indicated on sourceforge projets as an example). I would tend to say that the strength of an open source community, measured in term of growth and size, is the one most important criteria to rely on when choosing an open source product.
Nowadays, the (real) stock market relies strongly on rating agencies. There must be a room (and thus a business opportunity) for an open source rating agency that would produce strong evidences about the relative strength of project communities.
What do you think ?
Web scraping with Python (part II)
Friday, March 11th, 2005The first part of this article dealt with retrieving HTML pages from the web with the help of a mechanize-propelled web crawler. Now your HTML pieces are safely saved locally on your hard drive and you want to extract structured data from them. This is part 2, HTML parsing with Python. For this task, I adopted a slightly more imaginative approach than for my crawling hacks. I designed a data extraction technology based on HTML templates. Maybe this could be called “reverse-templating” (or something like template-based reverse-web-engineering).
You may be used with HTML templates for producing HTML pages. An HTML template plus structured data can be transformed into a set of HTML pages with the help of a proper templating engine. One famous technology for HTML templating is called Zope Page Templates (because this kind of templates is used within the Zope application server). ZPTs use a special set of additional HTML tags and attributes referred to by the “tal:” namespace. One advantage of ZPT (over competing technologies) is that ZPT are nicely rendered in WYSIWYG HTML editors. Thus web designers produce HTML mockups of the screens to be generated by the application. Web developpers insert tal: attributes into these HTML mockups so that the templating engine will know which parts of the HTML template have to be replaced by which pieces of data (usually pumped from a database). As an example, web designers will say <title>Camcorder XYZ</title> then web developpers will modify this into <title tal:content=”camcorder_name”>Camcorder XYZ</title> and the templating engine will further produce a <title>Camcorder Canon MV6iMC</title> when it processes the “MV6iMC” record in your database (it replaces the content of the title element with the value of the camcorder_name variable as it is retrieved from the current database record). This technology is used to merge structured data with HTML templates in order to produce Web pages.
I took inspiration from this technology to design parsing templates. The idea here is to reverse the use of HTML templates. In the parsing context, HTML templates are still produced by web developpers but the templating engine is replaced by a parsing engine (known as web_parser.py, see below for the code of this engine). This engine takes HTML pages (the ones you previously crawled and retrieved) plus ZPT-like HTML templates as input. It then outputs structured data. First your crawler saved <title>Camcorder Canon MV6iMC</title>. Then you wrote <title tal:content=”camcorder_name”>Camcorder XYZ</title> into a template file. Eventually the engine will output camcorder_name = “Camcorder Canon MV6iMC”.
In order to trigger the engine, you just have to write a small launch script that defines several setup variables such as :
- the URL of your template file,
- the list of URLs of the HTML files to be parsed,
- whether you would like or not to pre-process these files with an HTML tidying library (this is useful when the engine complains about badly formed HTML),
- an arbitrary keyword defining the domain of your parsing operation (may be the name of the web site your HTML files come from),
- the charset these HTML files are made with (no automatic detection at the moment, sorry…)
- the output format (csv-like file or semantic web document)
- an optional separator character or string if ever you chose the csv-like output format
The easiest way to go is to copy and modify my example launch script (parser_dvspot.py) included in the ZIP distribution of this web_parser.
Let’s summarize the main steps to go through :
- install utidylib into your python installation
- copy and save my modified version of BeautifulSoup into your python libraries directory (usually …/Lib/site-packages)
- copy and save my engine (web_parser.py) into your local directory or into you python libraries directory
- choose a set of HTML files on your hard drive or directly on a web site,
- save one of these files as your template,
- edit this template file and insert the required pseudotal attributes (see below for pseudotal instructions, and see the example dvspot template template_dvspot.zpt),
- copy and edit my example launch script so that you define the proper setup variables in it (the example parser_dvspot.py contains more detailed instructions than above), save it as my_script.py
- launch your script with a python my_script.py > output_file.cowl (or python my_script.py > output_file.cowl)
- enjoy yourself and your fresh output_file.owl or output_file.csv (import it within Excel)
- give me some feedback about your reverse-templating experience (preferably as a comment on this blog)
This is just my first attempt at building such an engine and I don’t want to make confusion between real (and mature) tal attributes and my pseudo-tal instructions. So I adopted pseudotal as my main namespace. In some future, when the specification of these reverse-templating instructions are somewhat more stabilized (and if ever the “tal” guys agree), I might adopt tal as the namespace. Please also note that the engine is somewhat badly written : the code and internal is rather clumsy. There is much room for future improvement and refactoring.
The current version of this reverse-templating engine now supports the following template attributes/instructions (see source code for further updates and documentation) :
- pseudotal:content gives the name of the variable that will contain the content of the current HTML element
- pseudotal:replace gives the name of the variable that will contain the entire current HTML element
- (NOT SUPPORTED YET) pseudotal:attrs gives the name of the variable that will contain the (specified?) attribute(s ?) of the current HTML element
- pseudotal:condition is a list of arguments ; gives the condition(s) that has(ve) to be verified so that the parser is sure that current HTML element is the one looked after. This condition is constructed as a list after BeautifulSoup fetch arguments : a python dictionary giving detailed conditions on the HTML attributes of the current HTML element, some content to be found in the current HTML element, the scope of research for the current HTML element (recursive search or not)
- pseudotal:from_anchor gives the name of the pseudotal:anchor that is used in order to build the relative path that leads to the current HTML element ; when no from_anchor is specified, the path used to position the current HTML element is calculted from the root of the HTML file
- pseudotal:anchor specifies a name for the current HTML element ; this element can be used by a pseudotal:from_anchor tag as the starting point for building the path to the element specified by pseudotal:from_anchor ; usually used in conjunction with a pseudotal:condition ; the default anchor is the root of the HTML file.
- pseudotal:option describes some optional behavior of the HTML parser ; is a list of constants ; contains NOTMANDATORY if the parser should not raise an error when the current element is not found (it does as default) ; contains FULL_CONTENT when data looked after is the whole content of the current HTML element (default is the last part of the content of the current HTML element, i.e. either the last HTML tags or the last string included in the current element)
- pseudotal:is_id_part a special ‘id’ variable is automatically built for every parsed resource ; this id variable is made of several parts that are concatenated ; this pseudotal:is_id_part gives the index the current variable will be used at for building the id of the current resource ; usually used in conjunction with pseudotal:content, pseudotal:replace or pseudotal:attrs
- (NOT SUPPORTED YET) pseudotal:repeat specifies the scope of the HTML tree that describes ONE resource (useful when several resources are described in one given HTML file such as in a list of items) ; the value of this tag gives the name of a class that will instantiate the parsed resource scope plus the name of a list containing all the parsed resource
The current version of the engine can output structured data either as a CSV-like output (tab-delimited for example) or as an RDF/OWL document (of Semantic-Web fame). Both formats can easily be imported and further processed with Excel. The RDF/OWL format gives you the ability to process it with all the powerful tools that are emerging along the Semantic Web effort. If you feel adventurous, you may thus import your RDF/OWL file into Stanford’s Protege semantic modeling tool (or into Eclipse with its SWEDE plugin) and further process your data with the help of a SWRL rules-based inference engine. The future Semantic Web Rules Language will help at further processing this output so that you can powerfully compare RDF data coming from distinct sources (web sites). In order to be more productive in terms of fancy buzz-words, let’s say that this reverse-templating technology is some sort of a web semantizer. It produces semantically-rich data out of flat web pages.
The current version of the engine makes an extensive use of BeautifulSoup. Maybe it should have been based on a more XMLish approach instead (using XML pathes ?). But it would have implied that the HTML templates and HTML files to be processed should then have been turned into XHTML. The problem is that I would then have relied on utidylib but this library breaks too much some mal-formed HTML pages so that they are not valuable anymore.
Current known limitation : there is currently no way to properly handle some situations where you need to make the difference between two similar anchors. In some cases, two HTML elements that you want to use as distinct anchors have in fact exactly the same attributes and content. This is not a problem as long as these two anchors are always positioned at the same place in all the HTML page that you will parse. But, as soon as one of the anchors is not mandatory or it is located after a non mandatory element, the engine can get lost and either confuse the two anchors or complain that one is missing. At the moment, I don’t know how to handle this kind of situation. Example : long lists of specifications with similar names where some specifications are optional (see canon camcorders as an example : difference between lcd number of pixels and viewfinder number of pixels). The worst case scenario would be when there is a flat list of HTML paragraphs. The engine will try to identify these risks and should output some warnings in this kind of situations.
Here are the contents of the ZIP distribution of this project (distributed under the General Public License) :
- web_parser.py : this is the web parser engine.
- parser_dvspot.py : this is an example launch script to be used if you want to parser HTML files coming from the dvspot.com web site.
- template_dvspot.zpt : this is the example template file corresponding to the excellent dvspot.com site
- BeautifulSoup.py : this is MY version of BeautifulSoup. Indeed, I had to modify Leonard Richardson’s official one and I couldn’t obtain any answer from him at the moment regarding my suggested modifications. I hope he will soon answer me and maybe include my modifications in the official version or help me overcoming my temptation to fork. My modifications are based on the official 1.2 release of beautifulsoup : I added “center” as a nestable tag and added the ability to match the content of an element with the help of wildcards. You should save this BeautifulSoup.py file into the “Lib\site-packages” folder of your python installation.
- README.html is the file you are currently reading, also published on my blog.
P2P + Web Sémantique + Réseaux sociaux + Bureautique = ?
Wednesday, March 9th, 2005Prenez une once de peer-to-peer, trois coudées de web sémantique, deux livres de bureautique et un denier de réseau sociaux, malaxez avec énergie et vous obtenez… le “Networked Semantic Desktop”. Ca c’est de la convergence où je ne m’y connais pas… Projet de recherche, circulez, il n’y a rien à télécharger ! Vu également ici.
Jointure d’identité et réseaux sociaux
Wednesday, March 9th, 2005IBM a récemment acquis SRD, éditeur d’un logiciel qui met en correspondance :
- diverses descriptions informatiques d’un même individu (jointure avancée d’identités),
- les relations établies entre individus d’après leurs points communs (établissements de réseaux sociaux)
Bref, un logiciel tout à fait adéquat pour qui veut se constituer la parfaite panoplie du petit big brother.
Cette technologie semble un peu similaire aux méta-annuaires qui se sont spécialisés dans la jointure simple d’identités électroniques. Mais elle semble trouver ses principales applications dans le domaine du data mining pour le marketing et la CRM, dans l’analyse du risque crédit et dans le renseignement d’Etat et la sécurité privée. Elle me fait penser à Semagix qui, tout en ciblant des champs d’application similaires (CRM, détection d’opérations financières frauduleuses et déconstruction de réseaux sociaux mafieux), mise sur une approche “web sémantique”.
Il serait sans doute intéressant de savoir sur quelle approche technologique s’appuie SRD et si cette techno est suffisamment robuste, fiable et automatisée pour apporter des éléments de solution dans des problématiques très opérationnelles de gestion des identités électroniques.
Web-SSO : A CAS client for Zope
Thursday, February 24th, 2005The Central Authentication Service (aka CAS) is an open source lightweight framework that provides Web Single Sign On to big organizations (universities, agencies, corporations). It seems to be wildly used and seen as as much mature and reliable as the struts framework.
An existing server can benefit CAS WebSSO features if its technology is supported by a CAS client. So, please welcome Zope’s CAS User Folder, that SSOizes Zope within complex SSO infrastructures.
Le Gartner consacre blogs et wikis
Thursday, February 24th, 2005Le Gartner Group reconnait dans les wikis, les blogs, les logiciels de réseautage social et RSS un fort potentiel pour l’entreprise. L’attention portée par le cabinet d’analyse au mouvement de la gestion des connaissances “grass-roots” contribuer à apporter à celui-ci la légitimité (la consécration ?) qui lui permettront de prendre pied dans le secteur privé.
Depuis quelques mois, je sentais le vent venir : mon chef me parle de plus en plus souvent blogs et RSS (“c’est quoi ?”, “à quoi ça sert ?”, “comment je pourrais essayer ?”). Au début, c’était peut-être un peu pour me faire plaisir ? Mais, non, il a même souhaité que je lui installe un agrégateur RSS sur son poste de travail. Ah ! Du concret ! Ajouté à cela tout le buzzwording du Gartner et autres MetaGroup sur le sujet (“blogs et wiki sont les outils de collaboration de troisième génération”), on peut dire que, ça y est, les grandes entreprises portent l’attention de leur informatique sur ce sujet (il était temps). Maintenant, il faudra encore attendre un peu avant de voir des usages prendre racines à l’échelle de l’entreprise entière. En attendant, carnettons et agrégeons tous en coeur !
Zemantic: a Zope Semantic Web Catalog
Monday, February 14th, 2005Zemantic is an RDF module for Zope (read its announcement). From what I read (not tested by me yet), it implements services similar to zope catalogs and enables universal management of references (such as the Archetypes reference engine but in a more sustainable way). It is based on RDFLib, similarly to ROPE.
I feel enthusiastic about this product since it sounds to me like a good future-proof solution for the management of metadata, references and structured data within content management systems and portals. Plus Zemantic would sit well in my vision of Plone as a semantic aggregator.
Modernisation des processus de gestion des identités électroniques de Saint-Gobain
Saturday, January 1st, 2005[Ceci est le résumé de l'une de mes réalisations professionnelles. Je m'en sers pour faire ma pub dans l'espoir de séduire de futurs partenaires. Plus d'infos à ce sujet dans le récit de mon parcours professionnel.]
En 2000, l’informatique de Saint-Gobain est tellement décentralisée et hétérogène que la sécurité centrale ignore qui fait ou non partie du groupe. La direction des systèmes d’information me charge de diriger la modernisation des identités électroniques des 200 000 personnes du Groupe. J’analyse les processus clefs à moderniser : arrivée et départ des personnes. Je conduis le changement avec les acteurs concernés (RH, informatique et moyens généraux). Je dirige la conception et le déploiement d’une infrastructure technique de gestion des identités. En 2005, 50 000 des 200 000 personnes sont immatriculées et disposent d’un identifiant informatique unique. L’identité de 30 000 d’entre elles et leur appartenance au Groupe est tenue à jour en temps réel dans l’Annuaire Groupe. La suite du déploiement est planifiée pour les trois ans à venir.
Web scraping with python (part 1 : crawling)
Wednesday, December 29th, 2004Example One : I am looking for my next job. So I subscribe to many job sites in order to receive notifications by email of new job ads (example = Monster…). But I’d rather check these in my RSS aggregator instead of my mailbox. Or in some sort of aggregating Web platform. Thus, I would be able to do many filtering/sorting/ranking/comparison operations in order to navigate through these numerous job ads.
Example Two : I want to buy a digital camcorder. So I want to compare the available models. Such a comparison implies that I rank the most common models according to their characteristics. Unfortunately, the many sites providing reviews or comparisons of camcorders are not often comprehensive and they don’t offer me the capability of comparing them with respect to my way of ranking and weighting the camcorder features (example = dvspot). So I would prefer pumping all the technical stuff from these sites and manipulate this data locally on my computer. Unfortunately, this data is merged within HTML. And it may be complex to extract it automatically from all the presentation code.
These are common situations : interesting data spread all over the web and merged in HTML presentation code. How to consolidate this data so that you can analyze and process it with your own tools ? In some near future, I expect this data will be published so that it is directly processable by computers (this is what the Semantic Web is intending to do). For now, I was used to do it with Excel (importing Web data, then cleaning it and the like) and I must admit that Excel is fairly good at it. But I’d like some more automation for this process. I’d like some more scripting for this operation so that I don’t end with inventing complex Excel macros or formulas just to automate Web site crawling, HTML extraction and data cleaning. With such an itch to scratch, I tried to address this problem with python.
This series of messages introduces my current hacks that automate web sites crawling and data extraction from HTML pages. The current output of these scripts is a bunch of CSV files that can be further processed … in Excel. I wish I would output RDF instead of CSV. So there remains much room for further improvement (see RDF Web Scraper for a similar but approach). Anyway… Here is part One : how to crawl complex web sites with Python ?. The next part will deal with data extraction from the retrieved web pages, involving much HTML cleansing and parsing.
My crawlers are fully based on the John L. Lee’s mechanize framework for python. There are other tools available in Python. And several other approaches are available when you want to deal with automating the crawling of web sites. Note that you can also try to scrape the screens of legacy terminal-based applications with the help of python (this is called “screen scraping”). Some approaches of web crawling automation rely on recording the behaviour of a user equipped with a web browser and then reproduce this same behaviour in an automated session. That is an attractive and futuristic approach. But this implies that you find a way to guess what the intended automatic crawling behaviour will be from a simple example. In other words, with this approach, you have either to ask the user to click on every web link (all the job postings…) and this gives no value to the automation of the task. Or your system “guesses” what automatic behaviour is expected just by recording a sample of what a human agent would do. Too complex… So I preferred a more down-to-earth solution implying that you write simple crawling scripts “by hand”. (You may still be interested in automatically record user sessions in order to be more productive when producing your crawling scripts.) As a summary : my approach is fully based on mechanize so you may consider the following code as example of uses of mechanize in “real-world” situations.
For purpose of clarity, let’s first focus on the code part that is specific to your crawling session (to the site you want to crawl) . Let’s take the example of the dvspot.com site which you may try to crawl in order to download detailed description of camcorders :
# Go to home page
#
b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0")
#
# Navigate through the paginated list of cameras
#
next_page = 0
while next_page == 0:
#
# Display and save details of every listed item
#
url = b.response.url
next_element = 0
while next_element >= 0:
try:
b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)
next_element = next_element + 1
print save_response(b,"dvspot_camera_"+str(next_element))
# go back to home page
b.open(url)
# if you crawled too many items, stop crawling
if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION:
next_element = -1
next_page = -1
except LinkNotFoundError:
# You certainly reached the last item in this page
next_element = -1
#
try:
b.open(url)
b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)
print "processing Next Page"
except LinkNotFoundError:
# You reached the last page of the listing of items
next_page = -1
You noticed that the structure of this code (conditional loops) depends on the organization of the site you are crawling (paginated results, …). You also have to specify the rule that will trigger “clicks” from your crawler. In the above example, your script first follows every link containing “cameraDetail” in its URL (url_regex). Then it follows every link containing “Next Page” in the hyperlink text (text_regex).
This kind of script is usually easy to design and write but it can become complex when the web site is improperly designed. There are two sources of difficulties. The first one is bad HTML. Bad HTML may crash the mechanize framework. This is the reason why you often have to pre-process the HTML either with the help of a HTML tidying library or with simple but string substitutions when your tidy library breaks the HTML too much (this may be the case when the web designer improperly decided to used nested HTML forms). Designing the proper HTML pre-processor for the Web site you want to crawl can be tricky since you may have to dive into the faulty HTML and the mechanize error tracebacks in order to identify the HTML mistakes and workaround them. I hope that future versions of mechanize would implement more robust HTML parsing capabilities. The ideal solution would be to integrate the Mozilla HTML parsing component but I guess this will be some hard work to do. Let’s cross our fingers.
Here are useful examples of pre-processors (as introduced by some other mechanize users and developpers) :
class TidyProcessor(BaseProcessor):
def http_response(self, request, response):
options = dict(output_xhtml=1,
add_xml_decl=1,
indent=1,
output_encoding='utf8',
input_encoding='latin1',
force_output=1
)
r = tidy.parseString(response.read(), **options)
return FakeResponse(response, str(r))
https_response = http_response
#
class MyProcessor(BaseProcessor):
def http_response(self, request, response):
r = response.read()
r = r.replace('"image""','"image"')
r = r.replace('"','"')
return FakeResponse(response, r)
https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())
The second source of difficulties comes from non-RESTful sites. As an example the APEC site (a French Monster-like job site) is based on a proprietary web framework that implies that you cannot rely on links URLs to automate your browsing session. It took me some time to understand that, once loggin in, every time you click on a link, you are presented with a new frameset referring to the URLs that contain the interesting data you are looking for. And these URLs seem to be dependent on your session. No permalink, if you prefer. This makes the crawling process even more tricky. In order to deal with this source of difficulty when you write your crawling script, you have to open both your favorite text editor (to write the script) and your favorite web browser (Firefox of course !). One key knowledge is to know mechanize “find_link” capabilities. These capabilities are documented in _mechanize.py source code, in the find_link method doc strings. They are the arguments you will provide to b.follow_link in order to automate your crawler “clicks”. For more convenience, let me reproduce them here :
- text: link text between link tags: <a href=”blah”>this bit</a> (as
returned by pullparser.get_compressed_text(), ie. without tags but
with opening tags “textified” as per the pullparser docs) must compare
equal to this argument, if supplied - text_regex: link text between tag (as defined above) must match the
regular expression object passed as this argument, if supplied
name, name_regex: as for text and text_regex, but matched against the
name HTML attribute of the link tag - url, url_regex: as for text and text_regex, but matched against the
URL of the link tag (note this matches against Link.url, which is a
relative or absolute URL according to how it was written in the HTML) - tag: element name of opening tag, eg. “a”
predicate: a function taking a Link object as its single argument,
returning a boolean result, indicating whether the links - nr: matches the nth link that matches all other criteria (default 0)
Links include anchors (a), image maps (area), and frames (frame,iframe).
Enough with explanations. Now comes the full code in order to automatically download camcorders descriptions from dvspot.com. I distribute this code here under the GPL (legally speaking, I don’t own the copyleft of this entire code since it is based on several snippets I gathered from the web and wwwsearch mailing list). Anyway, please copy-paste-taste !
from mechanize import Browser,LinkNotFoundError
from ClientCookie import BaseProcessor
from StringIO import StringIO
# import tidy
#
import sys
import re
from time import gmtime, strftime
#
# The following two line is specific to the site you want to crawl
# it provides some capabilities to your crawler for it to be able
# to understand the meaning of the data it is crawling ;
# as an example for knowing the age of the crawled resource
#
from datetime import date
# from my_parser import parsed_resource
#
"""
Let's declare some customized pre-processors.
These are useful when the HTML you are crawling through is not clean enough for mechanize.
When you crawl through bad HTML, mechanize often raises errors.
So either you tidy it with a strict tidy module (see TidyProcessor)
or you tidy some errors you identified "by hand" (see MyProcessor).
Note that because the tidy module is quite strict on HTML, it may change the whole
structure of the page you are dealing with. As an example, in bad HTML, you may encounter
nested forms or forms nested in tables or tables nested in forms. Tidying them may produce
unintended results such as closing the form too early or making it empty. This is the reason
you may have to use MyProcessor instead of TidyProcessor.
"""
#
class FakeResponse:
def __init__(self, resp, nudata):
self._resp = resp
self._sio = StringIO(nudata)
#
def __getattr__(self, name):
try:
return getattr(self._sio, name)
except AttributeError:
return getattr(self._resp, name)
#
class TidyProcessor(BaseProcessor):
def http_response(self, request, response):
options = dict(output_xhtml=1,
add_xml_decl=1,
indent=1,
output_encoding='utf8',
input_encoding='latin1',
force_output=1
)
r = tidy.parseString(response.read(), **options)
return FakeResponse(response, str(r))
https_response = http_response
#
class MyProcessor(BaseProcessor):
def http_response(self, request, response):
r = response.read()
r = r.replace('"image""','"image"')
r = r.replace('"','"')
return FakeResponse(response, r)
https_response = http_response
#
# Open a browser and optionally choose a customized HTML pre-processor
b = Browser()
b.add_handler(MyProcessor())
#
""""
Let's declare some utility methods that will enhance mechanize browsing capabilities
"""
#
def find(b,searchst):
b.response.seek(0)
lr = b.response.read()
return re.search(searchst, lr, re.I)
#
def save_response(b,kw='file'):
"""Saves last response to timestamped file"""
name = strftime("%Y%m%d%H%M%S_",gmtime())
name = name + kw + '.html'
f = open('./'+name,'w')
b.response.seek(0)
f.write(b.response.read())
f.close
return "Response saved as %s" % name
#
"""
Hereafter is the only (and somewhat big) script that is specific to the site you want to crawl.
"""
#
def dvspot_crawl():
"""
Here starts the browsing session.
For every move, I could have put as a comment an equivalent PBP command line.
PBP is a nice scripting layer on top of mechanize.
But it does not allow looping or conditional browsing.
So I preferred scripting directly with mechanize instead of using PBP
and then adding an additional layer of scripting on top of it.
"""
#
MAX_NR_OF_ITEMS_PER_SESSION = 500
#
# Go to home page
#
b.open("http://www.dvspot.com/reviews/cameraList.php?listall=1&start=0")
#
# Navigate through the paginated list of cameras
#
next_page = 0
while next_page == 0:
#
# Display and save details of every listed item
#
url = b.response.url
next_element = 0
while next_element >= 0:
try:
b.follow_link(url_regex=re.compile(r"cameraDetail"), nr=next_element)
next_element = next_element + 1
print save_response(b,"dvspot_camera_"+str(next_element))
b.open(url)
# if you crawled too many items, stop crawling
if next_element*next_page > MAX_NR_OF_ITEMS_PER_SESSION:
next_element = -1
next_page = -1
except LinkNotFoundError:
# You reached the last item in this page
next_element = -1
#
try:
b.open(url)
b.follow_link(text_regex=re.compile(r"Next Page"), nr=0)
print "processing Next Page"
except LinkNotFoundError:
# You reached the last page of the listing of items
next_page = -1
#
return
#
#
#
if __name__ == '__main__':
#
""" Note that you may need to specify your proxy first.
On windows, you do :
set HTTP_PROXY=http://proxyname.bigcorp.com:8080
"""
#
dvspot_crawl()
In order to run this code, you will have to install mechanize 0.0.8a, pullparser 0.0.5b, clientcookie 0.4.19, clientform 0.0.16 and utidylib. I used Python 2.3.3. Latest clientcookie’s version was to be integrated into Python 2.4 I think. In order to install mechanize, pullparser, clientcookie and clientform, you just have to do the usual way :
python setup.py build python setup.py install python setup.py test
Last but not least : you should be aware that you may be breaking some terms of service from the website you are trying to crawl. Thanks to dvspot for providing such valuable camcorders data to us !
Next part will deal with processing the downloaded HTML pages and extract useful data from them.
RDF vu par un pouet
Thursday, December 9th, 2004La technologie RDF est la brique de base du web sémantique. Dans une grange, un poète vous rappelle qu’à l’école maternelle, vous saviez déjà faire du RDF avec des ours funambules.
Carnets web en entreprise : suite
Monday, October 18th, 2004Pour faire suite à mes deux derniers messages sur les carnets web en entreprise, ne pas louper la lecture des articles connexes de Padawan et de Loic Le Meur. Au passage, bonjour à Gilles (celui qui est en vrac et en ligne) qui s’intéresse québecquoisement aux mêmes sujets ! :)
Portails / CMS en J2EE
Tuesday, October 5th, 2004Pour créer un portail d’entreprise en J2EE, il y a le choix entre acheter un coûteux portail propriétaire (IBM ou BEA pour ne citer que les leaders des serveurs d’application J2EE) ou recourir à un portail J2EE open source. Mais autant l’offre open source en matière de serveurs d’application J2EE (JBoss, Jonas) atteint une certaine maturité qui la rend crédible pour des projets de grande envergure, autant l’offre open source en matière de portails J2EE semble largement immature. Ceci semble fermer à l’open source le marché des portails et de la gestion de contenu des grandes entreprises pour encore de nombreuses années.
Aux yeux de la communauté J2EE, des cabinets de conseil du secteur et des gros éditeurs, le meilleur produit du marché sera nécessairement celui qui supportera au moins les deux standards du moment : JSR 168 pour garantir la portabilité des portlets d’un produit à l’autre, et WSRP pour garantir l’interopérabilité des portlets distantes entre leur serveur d’application et le portail qui les agrège et les publie. Il y a donc dans cette gamme de produit une course à celui qui sera le plus dans la mode de la “SOA” (Service-Oriented Architecture). Comme portails J2EE open source, on cite fréquemment Liferay et Exo. Cette offre open source n’est pas étrangère à la fanfaronnade SOA (il faut bien marketer les produits, eh oui…). Du coup, l’effort de développement des portails J2EE open source semble davantage porter sur l’escalade de la pile SOA que sur l’implémentation de fonctionnalités utiles. C’est sûrement ce qui amène la communauté J2EE à constater que les portails J2EE open source manquent encore beaucoup de maturité et de richesse fonctionnelle surtout lorsqu’on les compare à Plone, leader du portail / CMS open source. En effet, Plone s’appuie sur un serveur d’application Python (Zope) et non Java (a fortiori non J2EE) ; il se situe donc hors de la course à JSR168 et semble royalement ignorer le bluff WSRP.
Nombreuses sont les entreprises qui s’évertuent à faire de J2EE une doctrine interne en matière d’architecture applicative. Confrontées au choix d’un portail, elles éliminent donc rapidement l’offre open source J2EE (pas assez mûre). Et, plutôt que de choisir un portail non J2EE reconnu comme plus mûr, plus riches en fonctionnalités et moins coûteux, elles préfèrent se cantonner à leur idéologie J2EE sous prétexte qu’il n’y a point de salut hors J2EE/.Net. Pas assez buzzword compliant, mon fils… Pfff, ne suivez pas mon regard… :-(
Blogs, klogs, plogs… en entreprise
Wednesday, September 29th, 2004On les appelle couramment des weblogs, ou blogs pour faire plus court, ou carnets web pour faire plus francophone. Certains carnets web s’étant spécialisés, on a poussé les néologismes : klogs désigne les carnets web dédiés au partage de connaissance (knowledgelogs) ; plogs désigne les carnets de bord d’équipes projets (project logs) ; moblogs désigne les carnets web dont la mise à jour s’effectue depuis un PDA ou un téléphone portable (mobile logs). Sans compter les photologs et autres vlogs (video logs). Et puisque les carnets Web font leur entrée dans le monde de l’entreprise, on en revient à dire que blog = business-log.
Un journaliste du magazine CIO (dédié aux directeurs informatiques) confirme que parmi les plus grosses entreprises au monde (Fortune 500), un nombre significatif utilisent des blogs au sein de leurs départements informatiques notamment en tant que project logs pour coordonner et commenter l’avancement de projets informatiques.
Il cite dans son article les motivations de ces entreprises, et ses lecteurs en ajoutent quelques unes :
-
[leur] mélange de commentaires critiques est vu davantage comme constructif que l’inverse
-
si j’étais un gestionnaire des ventes d’un géant pharmaceutique, j’apprécierais de pouvoir de temps en temps parcourir le carnet de mon interlocteur informatique qui installe un système d’automatisation des forces de vente
-
[on peut] difficilement imaginer un meilleur moyen d’ancrer les nouveaux membres d’un service informatique dans un même contexte
-
un plog donne l’occasion à un leader d’observer dans son ensemble le “storyboard continu” [de son projet] pour évaluer si les actions ou les réflexions en cours vont permette de produire les livrables attendus pour le projet
Il peut ainsi réagir comme le ferait un réalisateur ou un metteur en scène
- contrairement aux approches top-down habituelles du knowledge management,
les plogs et leurs cousins permettent au savoir de rester proche du créateur de ce savoir
Les carnets web sont des outils individuels et qui valorisent la contribution de l’individu plutôt que de le noyer dans la masse
- les blogs prennent le relais électronique de la machine à café
L’auteur de cet article, et ceux qui l’ont commenté, citent également divers risques qu’il s’agit de gérer intelligemment dans l’adoption des carnets web en entreprise :
- rester constructif : la motivation des lecteurs d’un carnet de projet doit davantage être la curiosité (savoir où en est le projet par exemple) que la volonté d’interférer
- prévenir les ingérances indues : à la lecture d’un plog, grande peut être la tentation de devenir un micromanager qui interfère indûment dans les affaires en cours
- prévenir les crises de blogorrhée :
la ligne entre la libre expression et l’auto-indulgence est effroyablement fine
et les carnetiers peuvent avoir tendance à verser dans l’auto-promotion ou le noyage de leurs lecteurs potentiels dans une prose égocentrique qui n’intéresse qu’eux-mêmes
- éviter de communiquer plutôt que de travailler : il arrive qu’à force de prendre du plaisir à communiquer avec ses collègues, on en perde le sens des priorités !
- ne pas se laisser abuser par une belle communication : un plog peut devenir un outil de politique de couloirs, une caisse de résonance pour ceux qui savent que leur manager n’est pas capable de distinguer les vantards des collaborateurs efficaces
- être efficace : pour que les carnets web ne soient pas “encore une autre tentive de gérer les connaissances”, il s’agit que leur adoption soit guidée par le pragmatisme et les usages qu’en font les utilisateurs pilotes et non par les concepts ou les outils
Dave Pollard avait quant à lui, sur son carnet Web, réuni un certain nombre de (bons) conseils pour mettre en oeuvre un politique de carnettage dans une entreprise :
- Les blogs sont individuels (non aux carnets d’équipes)
- La taxonomie d’un blog doit rester propre à son auteur (et ne pas se perdre dans une politique bureaucratique ou technocratique de classements/catégorisation de concepts !). Elle reprendra typiquement la manière dont l’auteur organise le répertoire “Mes documents” de son poste de travail ou bien sa boîte aux lettres ou plus simplement son armoire.
- Les meilleurs candidats au carnettage en entreprise sont ceux qui ont déjà l’habitude de publier abondamment en entreprise : éditeurs de newsletters, experts, communiquants. Ce sont ceux que l’on citera spontanément en répondant à la question : “lequel de vos collaborateurs a des fichiers dont le contenu vous serait le plus utile dans votre travail ?”
- Pour chaque possesseur d’un carnet web, demandez à vos informaticiens de convertir en HTML et de mettre en ligne dans son carnet Web l’ensemble de ses fichiers bureautique, pour constituer une archive qui apportera une valeur immédiate à ses lecteurs.
- Avec l’aide de vos équipes marketing, créer chez vos clients l’envie d’accéder à certains carnets web de vos collaborateurs, comme si il s’agissait d’un canal privilégié de relation avec l’entreprise.
Le journaliste de CIO.com estime que
les organisations IT qui utilisent efficacement les blogs comme outils de management (ou comme ressources pour la communication) sont probablement des environnements de développement [humain] qui prennent au sérieux les personnes et les idées.
Il estime enfin que
lorsqu’un développeur ou un manager ou un chargé de support clientèle réussit à produire un plog qui suscite l’attention, qui sensibilise et qui suscite le changement, alors c’est une compétence qui mérite reconnaissance et récompense.
La vision de chez Mac Donald Bradley au sujet du web sémantique
Friday, September 17th, 2004J’ai été très impressionné par la qualité de la vision du directeur scientifique de chez Mc Donald Bradley au sujet du web sémantique. Il présente non seulement de très justes illustrations de la vision de Tim Berner’s Lee mais il la remet également de manière très pertinente dans le contexte général de l’évolution de l’informatique sur les dernières décennies, à travers notamment la perspective d’applications concrètes pour l’entreprise. Sa déclaration d’indépendance des données laisse présager un avenir excellent pour la nouvelle discipline informatique qu’est l’architecture de l’information. McDonald Bradley est une entreprise que je trouve d’autant plus intéressante qu’elle se positionne sur des marchés verticaux clairement délimités, au sein du secteur public (et donc précurseurs en matière d’open source) : les services de renseignement, la défense, la sécurité, les finances publiques et les collectivités locales. A rapprocher des interrogations de Kendall Grant Clark au sujet de l’appropriation du web sémantique par les communautés du libre ? Malheureusement, je crains qu’il n’existe pas d’entreprise équivalente en France…
XML.com: Implementing REST Web Services: Best Practices and Guidelines
Monday, August 30th, 2004On me reproche parfois d’être un peu trop théorique. Alors, concernant le style architectural REST, voici quelques bonnes pratiques et guide de conduite pour construire des services Web conformes au style REST.
Plone as a semantic aggregator
Thursday, August 12th, 2004Here is an output of my imagination (no code, sorry, just a speech) : what if a CMS such as Plone could be turned into a universal content aggregator. It would become able to retrieve any properly packaged content/data from the Web and import it so that it can be reused, enhanced, and processed with the help of Plone content management features. As a universal content aggregator, it would be able to “import” (or “aggregate”) any content whatever its structure and semantic may be. Buzzwords ahead : Plone would be a schema-agnostic aggregator. It would be a semantic-enabled aggretor
Example : On site A, beer-lovers gather. Site A’s webmaster has setup a specific data schema for the description of beers, beer flabours, beer makers, beer drinkers, and so on. Since site A is rich in terms of content and its community of users is enthusiastic, plenty of beers have been described there. Then site B, powered by a semantic aggregator (and CMS), is interested in any data regarding beverages and beverages impact on human’s health. So site B retrieves beer data from site A. In fact it retrieves both the description of beer1, beer2, beerdrinker1, … and the description of what a beer is, how data is structured when it describes a beer, what the relationship is between a beer and a beer drinker. So site B now knows many things about beer in general (data structure = schema) and many beers specifically (beers data). All this beer data on site B is presented and handled as specific content types. Site B’s users are now able to handle beer descriptions as content items, to process them through workflows, to rate them, to blog on them, and so on. And finallly to republish site B’s own output in such a way it can be aggregated again from other sites. That would be the definitive birth of the semantic web !
There are many news aggregators (RSSBandit, …) that know how to retrieve news items from remote sites. But they are only able to aggregate news data. They only know one possible schema for retrievable data : the structure of a news item (a title + a link + a description + a date + …). This schema is specified in the (many) RSS standard(s).
But now that CMS such as Plone are equipped with schema management engines (called “Archetypes” for Plone), they are able to learn new data schema specified in XML files. Currently, Plone’s archetypes is able to import any schema specified in the form of an XMI file output by any UML modelizing editor.
But XMI files are not that common on the Web. And the W3C published some information showing that any UML schema (class diagram I mean) is the equivalent of an RDF-S schema. And there even is a testbed converter from RDF-S to XMI. And there even are web directories inventoring existing RDF schemas as RDF-S files. Plus RSS 1.0 is based on RDF. Plus Atom designers designed it in such a way it is easily converted to RDF.
So here is my easy speech (no code) : let’s build an RDF aggregator product from Plone. This product would retrieve any RDF file from any web site. (It would store it in the Plone’s triplestore called ROPE for instance). It would then retrieve the associated RDF-S file (and store it in the same triplestore). It would convert it to an XMI file and import it as an Archetypes content type with the help of the ArchGenXML feature. Then it would import the RDF data as AT items conforming to the newly created AT content type. Here is a diagram summarizing this : 
By the way, Gillou (from Ingeniweb) did not wait for my imagination output to propose a similar project. He called it ATXChange. The only differences I see between his proposal and what is said above are, first, that Gillou might not be aware about RDF and RDF-S capabilities (so he might end with a Archetypes-specific aggregator inputting and outputting content to and from Plone sites only) and that Gillou must be able to provide code sooner or later whereas I may not be !
Last but not least : wordpress is somewhat going in the same direction. The semweb community is manifesting some interest in WP structured blogging features. And some plugins are appearing that try to incorporate more RDF features in WP (see also seeAlso).
Is the Semantic Web stratospheric enough ?
Friday, August 6th, 2004Did you think the Semantic Web is a stratospheric concept for people smoking too many HTTP connections ? If so, don’t even try to understand what Pierre Levy is intending to do. He and the associatied network of people say they are preparing the next step after the Semantic Web. Well… In fact, I even heard Pierre Levy saying he is preparing the next step in the evolution of mankind, so this is not such a surprise. The worst point in this story is that his ambitious work may be extremely relevant and insightful for all of us, mortals. :)
RSS : The Next Big Thing Online
Friday, July 23rd, 2004Voici un papier blanc (si, si…) au sujet de RSS (à nouveau via l’excellent Outils Froids). Enfin un document qui présente l’écosystème RSS en des termes marketing compréhensible par une D.S.I. de grande entreprise… enfin j’espère. Je testerai ce document sur mes collègues et supérieurs à mon retour de congés.
