Archives pour la catégorie Open source

Daisy vs. Plone, feature fighting

A Gouri-friend of mine recently pointed me to Daisy, a « CMS wiki/structured/XML/faceted » stuff he said. I answered him it may be a nice product but not enough attractive for me at the moment to spend much time analyzing it. Nevertheless, as Gouri asked, let’s browse Daisy’s features and try to compare them with Plone equivalents (given that I never tried Daisy).

The Daisy project encompasses two major parts: a featureful document repository

Plone is based on an object-oriented repository (Zope’s ZODB) rather than a document oriented repository.

and a web-based, wiki-like frontend.

Plone has its own web-based fronted. Wiki features are provided with an additional product (Zwiki).

If you have different frontend needs than those covered by the standard Daisy frontend, you can still benefit hugely from building upon its repository part.

Plone’s frontend is easily customizable either with your own CSS, with inherting from existing ZPT skins or with a WYSIWYG skin module such as CPSSkin.

Daisy is a Java-based application

Plone is Python-based.

, and is based on the work of many valuable open source packages, without which Daisy would not have been possible. All third-party libraries or products we redistribute are unmodified (unforked) copies.

Same for Plone. Daisy seems to be based on Cocoon. Plone is based on Zope.

Some of the main features of the document repository are:
* Storage and retrieval of documents.

Documents are one of the numerous object classes available in Plone. The basic object in Plone is… an object that is not fully extensible by itself unless it was designed to be so. Plone content types are more user-oriented than generic documents (they implement specialized behaviours such as security rules, workflows, displays, …). They will be made very extensible when the next versions of the « Archetypes » underlying layer is released (they include through-the-web schema management feature that allow web users to extend what any existing content type is).

* Documents can consists of multiple content parts and fields, document types define what parts and fields a document should have.

Plone’s perspective is different because of its object orientation. Another Zope product called Silva is more similar to Daisy’s document orientation.

Fields can be of different data types (string, date, decimal, boolean, …) and can have a list of values to choose from.

Same for Archetypes based content types in Plone.

Parts can contain arbitrary binary data, but the document type can limit the allowed mime types. So a document (or more correctly a part of a document) could contain XML, an image, a PDF document, … Part upload and download is handled in a streaming manner, so the size of parts is only limitted by the available space on your filesystem (and for uploading, a configurable upload limit).

I imagine that Daisy allows the upload and download of documents having any structure, with no constraint. In Plone, you are constrained by the object model of your content types. As said above this model can be extended at run time (schema management) but at the moment, the usual way to do is to define your model at design time and then comply with it at run time. At run time (even without schema management), you can still add custom metadata or upload additional attached files if your content type supports attached files.

* Versioning of the content parts and fields. Each version can have a state of ‘published’ or ‘draft’. The most recent version which has the state published is the ‘live’ version, ie the version that is displayed by default (depends on the behaviour of the frontend application of course).

The default behaviour of Plone does not include real versioning but document workflows. It means that a given content can be in state ‘draft’ or ‘published’ and go from one state to another according to a pre-defined workflow (with security conditions, event triggering and so). But a given object has only one version by default.
But there are additional Plone product that make Plone support versioning. These products are to be merged into Plone future distribution because versioning has been a long awaited feature. Note that, at the moment, you can have several versions of a document to support multi-language sites (one version per language).

* Documents can be marked as ‘retired’, which makes them appear as deleted, they won’t show up unless explicitely requested. Documents can also be deleted permanently.

Plone’s workflow mechanism is much more advanced. A default workflow includes a similar retired state. But the admin can define new workflows and modify the default one, always referring to the user role. Plone’s security model is quite advanced and is the underlying layer of every Plone functionality.

* The repository doesn’t care much what kind of data is stored in its parts, but if it is « HTML-as-well-formed-XML », some additional features are provided:
o link-extraction is performed, which allows to search for referers of a document.
o a summary (first 300 characters) is extracted to display in search results
o (these features could potentially be supported for other formats also)

There is no such thing in Plone. Maybe in Silva ? Plone’s reference engine allows you to define associations between objects. These associations are indexed by Plone’s search engine (« catalog ») and can be searched.

* all documents are stored in one « big bag », there are no directories.

Physically, the ZODB repository can have many forms (RDBMS, …). The default ZODB repository is a single flat file that can get quite big : Data.fs

Each document is identified by a unique ID (an ever-increasing sequence number starting at 1), and has a name (which does not need to be unique).

Each object has an ID but it is not globally unique at the moment. It is unfortunately stored in a hierarchical structure (Zope’s tree). Some Zope/Plone developpers wished « Placeless content » to be implemented. But Daisy must still be superior to Plone in that field.

Hierarchical structure is provided by the frontend by the possibility to create hierarchical navigation trees.

Zope’s tree is the most important structure for objects in a Plone site. It is too much important. You can still create navigation trees with shortcuts. But in fact, the usual solution in order to have maximum flexibility in navigation trees is to use the « Topic » content type. Topics are folder-like object that contain a dynamic list of links to objects matching the Topic’s pre-defined query. Topic are like persistent searches displayed as folders. As a an example a Topic may display the list of all the « Photo »-typed objects that are in « draft » state in a specific part (tree branch) of the site, etc.

* Documents can be combined in so-called « collections ». Collections are sets of the documents. One document can belong to multiple collections, in other words, collections can overlap.

Topics too ? I regret that Plone does easily not offer a default way to display a whole set of objects in just one page. As an example, I would have enjoyed to display a « book » of all the contents in my Plone site as if it were just one single object (so that I can print it…) But there are some Plone additional products (extensions) that support similar functionalities. I often use « Content Panels » to build a page by defining its global layout (columns and lines) and by filling it with « views » from exisiting Plone objects (especially Topics). Content Panels mixed with Topics allow a high flexibility in your site. But this flexibility has some limits too.

* possibility to take exclusive locks on documents for a limitted or unlimitted time. Checking for concurrent modifications (optimistic locking) happens automatically.

See versioning above.

* documents are automatically full-text indexed (Jakarta Lucene based). Currently supports plain text, XML, PDF (through PDFBox), MS-Word, Excel and Powerpoint (through Jakarta POI), and OpenOffice Writer.

Same for Plone except that Plone’s search engine is not Lucene and I don’t know if Plone can read OpenOffice Writer documents. Note that you will require additional modules depending on your platform in order to read Microsoft files.

* repository data is stored in a relation database. Our main development happens on MySQL/InnoDB, but the provisions are there to add support for new databases, for example PostgreSQL support is now included.

Everything is in the ZODB. By default stored as a single file. But can also be stored in a relational database (but this is usually useless). You can also transparently mix several repositories in a same Plone instance. Furthermore, instead of having Plone directly writing in the ZODB’s file, you can configure Plone so that it goes through a ZEO client-server setup so that several Plone instances can share a common database (load balancing). Even better, there is a commercial product, ZRS, that allows you to transparently replicate ZODBs so that several Plone instances setup with ZEO can use several redundant ZODBs (no single point of failure).

The part content is stored in normal files on the file system (to offload the database). The usage of these familiar, open technologies, combined with the fact that the daisywiki frontend stores plain HTML, makes that your valuable content is easily accessible with minimal « vendor » lock-in.

Everything’s in the ZODB. This can be seen as a lock-in. But it is not really because 1/ the product is open source and you can script a full export with Python with minimal effort, 2/ there are default WebDAV + FTP services that can be combined with Plone’s Marshall extension (soon to be included in Plone’s default distribution) that allows you to output your content from your Plone site. Even better, you can also upload your structured semantic content with Marshall plus additional hacks as I mentioned somewhere else.

* a high-level, sql-like query language provides flexible querying without knowing the details of the underlying SQL database schema. The query language also allows to combine full-text (Lucene) and metadata (SQL) searches. Search results are filtered to only contain documents the user is allowed to access (see also access control). The content of parts (if HTML-as-well-formed-XML) can also be selected as part of a query, which is useful to retrieve eg the content of an « abstract » part of a set of documents.

No such thing in Plone as far as I know. You may have to Pythonize my friend… Except that Plone’s tree gives an URL to every object so that you can access any part of the site. But not with a granularity similar to Daisy’s supposed one. See silva for more document-orientation.

* Accesscontrol: instead of attaching an ACL to each individual document, there is a global ACL which allows to specify the access rules for sets of documents by selecting those documents based on expressions. This allows for example to define access control rules for all documents of a certain type, or for all documents in a certain collection.

Access control is based on Plone’s tree, with inheritance (similar to Windows security model in some way). I suppose Plone’s access control is more sophisticated and maintainable than Daisy’s one but it should require more investigation to explain why.

* The full functionality of the repository is available via an HTTP+XML protocol, thus providing language and platform independent access. The documentation of the HTTP interface includes examples on how the repository can be updated using command-line tools like wget and curl.

Unfortunately, Plone is not ReST enough at the moment. But there is some hope the situation will change with Zope 3 (Zope’s next major release that is coming soon). Note that Zope (so Plone) supports HTTP+XML/RPC as a generic web service protocol. But this is nothing near real ReSTful web services…

* A high-level, easy to use Java API, available both as an « in-JVM » implementation for embedded scenarios or services running in the daisy server VM, as well as an implementation that communicates transparently using the HTTP+XML protocol.

Say Python and XML/RPC here.

* For various repository events, such as document creation and update, events are broadcasted via JMS (currently we include OpenJMS). The content of the events are XML messages. Internally, this is used for updating the full-text index, notification-mail sending and clearing of remote caches. Logging all JMS events gives a full audit log of all updates that happened to the repository.

No such mechanism as far as I know. But Plone of course offers fully detailed audit logs of any of its events.

* Repository extensions can provide additional services, included are:
o a notification email sender (which also includes the management of the subscriptions), allowing subscribing to individual documents, collections of documents or all documents.

No such generic feature by default in Plone. You can add scripts to send notification in any workflow transition. But you need to write one or two lines of Python. And the management of subscriptions is not implemented by default. But folder-like object support RSS syndication so that you can agregate Plone’s new objects in your favorite news aggregator;

o a navigation tree management component and a publisher component, which plays hand-in-hand with our frontend (see further on)

I’ll see further on… :)

* A JMX console allows some monitoring and maintenance operations, such as optimization or rebuilding of the fulltext index, monitoring memory usage, document cache size, or database connection pool status.

You have several places to look at for this monitoring within Zope/Plone (no centralized monitoring). An additional Plone product helps in centralizing maintenance operations. Still some ground for progress here.

The « Daisywiki » frontend
The frontend is called the « Daisywiki » because, just like wikis, it provides a mixed browsing/editing environment with a low entry barrier. However, it also differs hugely from the original wikis, in that it uses wysiwyg editing, has a powerful navigation component, and inherits all the features of the underlying daisy repository such as different document types and powerful querying.

Well, then we can just say the same for Plone and rename its skins the Plonewiki frontend… Supports Wysiwyg editing too, with customizable navigation tree, etc.

* wysiwyg HTML editing
o supports recent Internet Explorer and Mozilla/Firefox (gecko) browsers, with fallback to a textarea on other browsers. The editor is customized version of HTMLArea (through plugins, not a fork).

Same for Plone (except it is not an extension of HTMLArea but of a similar product).

o We don’t allow for arbitrary HTML, but limit it to a small, structural subset of HTML, so that it’s future-safe, output medium independent, secure and easily transformable. It is possible to have special paragraph types such as ‘note’ or ‘warning’. The stored HTML is always well-formed XML, and nicely layed-out. Thanks to a powerful (server-side) cleanup engine, the stored HTML is exactly the same whether edited with IE or Mozilla, allowing to do source-based diffs.

No such validity control within Plone. In fact, the structure of a Plone document is always valid because it is managed by Plone according to a specific object model. But a given object may contain an HTML part (a document’s body as an example) that may not be valid. If your documents are to have a recurrent inner structure, then you are invited to make this structure an extension of an object class so that is no more handled as a document structure. See what I mean ?

o insertion of images by browsing the repository or upload of new images (images are also stored as documents in the repository, so can also be versioned, have metadata, access control, etc)

Same with Plone except for versioning. Note that Plone’s Photo content type support automatic server-side redimensioning of images.

o easy insertion document links by searching for a document

Sometimes yes, sometimes no. It depends on the type of link you are creating.

o a heartbeat keeps the session alive while editing

I don’t know how it works here.

o an exlusive lock is automatically taken on the document, with an expire time of 15 minutes, and the lock is automatically refreshed by the heartbeat

I never tried the Plone extension for versioning so I can’t say. I know that you can use the WebDAV interface to edit a Plone object with your favorite text processing package if you want. And I suppose this interface properly manages this kind of issues. But I never tried.

o editing screens are built dynamically for the document type of the document being edited.

Of course.

* Version overview page, from which the state of versions can be changed (between published and draft), and diffs can be requested. * Nice version diffs, including highlighting of actual changes in changed lines (ignoring re-wrapping).

You can easily move any object in its associated workflow (from one state to another, through transitions). But no versioning. Note that you can use Plone’s wiki extension and this extension supports supports diffs and some versioning features. But this is not available for any Plone content type.

* Support for includes, i.e. the inclusion of one document in the other (includes are handled recursively).

No.

* Support for embedding queries in pages.

You can use Topics (persistent queries). You can embed them in Content Panels.

* A hierarchical navigation tree manager. As many navigation trees as you want can be created.

One and only one navigation tree by default. But Topics can be nested. So you can have one main navigation tree plus one or more alternatives with Topics (but these alternatives are limited for some reasons.).

Navigation trees are defined as XML and stored in the repository as documents, thus access control (for authoring them, read access is public), versioning etc applies. One navigation tree can import another one. The nodes in the navigation tree can be listed explicitely, but also dynamically inserted using queries. When a navigation tree is generated, the nodes are filtered according to the access control rules for the requesting user. Navigation trees can be requested in « full » or « contextualized », this last one meaning that only the nodes going to a certain document are expanded. The navigtion tree manager produces XML, the visual rendering is up to XSL stylesheets.

This is nice. Plone can not do that easily. But what Plone can do is still done with respect to its security model and access control, of course.

* A navigation tree editor widget allows easy editing of the navigation trees without knowledge of XML. The navigation tree editor works entirely client-side (Mozilla/Firefox and Internet Explorer), without annoying server-side roundtrips to move nodes around, and full undo support.

Yummy.

* Powerful document-publishing engine, supporting:
o processing of includes (works recursive, with detection of recursive includes)
o processing of embedded queries
o document type specific styling (XSLT-based), also works nicely combined with includes, i.e. each included document will be styled with its own stylesheet depending on its document type.

OK

* PDF publishing (using Apache FOP), with all the same features as the HTML publishing, thus also document type specific styling.

Plone document-like content type offer PDF views too.

* search pages:
o fulltext search
o searching using Daisy’s query language
o display of referers (« incoming links »)

Fulltext search is available. No query language for the user. Display of refers is only available for content type that are either wiki pages or have been given the ability to include references from other objects.

* Multiple-site support, allows to have multiple perspectives on top of the same daisy repository. Each site can have a different navigation tree, and is associated with a default collection. Newly created documents are automatically added to this default collection, and searches are limited to this default collection (unless requested otherwise).

It might be possible with Plone but I am not sure when this would be useful.

* XSLT-based skinning, with resuable ‘common’ stylesheets (in most cases you’ll only need to adjust one ‘layout’ xslt, unless you want to customise heavily). Skins are configurable on a per-site basis.

Plone’s skins are using the Zope Page Templates technology. This is a very nice and simple HTML templating technology. Plone’s skins make an extensive use of CSS and in fact most of the layout and look-and-feel of a site is now in CSS objects. These skins are managed as objects, with inheritance, overriding of skins and other sophisticated mechanism to configure them.

* User self-registration (with the possibility to configure which roles are assigned to users after self-registration) and password reminder.

Same is available from Plone.

* Comments can be added to documents.

Available too.

* Internationalization: the whole front-end is localizable through resource bundles.

Idem.

* Management pages for managing:
o the repository schema (the document types)
o the users
o the collections
o access control

Idem.

* The frontend currently doesn’t perform any caching, all pages are published dynamically, since this also depends on the access rights of the current user. For publishing of high-trafic, public (ie all public access as the same user), read-only sites, it is probably best to develop a custom publishing application.

Zope includes caching mechanisms that take care of access rights. For very high-trafic public sites, a Squid frontend is usually recommended.

* Built on top of Apache Cocoon (an XML-oriented web publishing and application framework), using Cocoon Forms, Apples (for stateful flow scenarios), and the repository client API.

By default, Zope uses its own embedded web server. But the usual setup for production-grade sites is to put an Apache reverse-proxy in front of it.

My conclusion : Daisy looks like a nice product when you have a very document-oriented project, with complex documents with structures varying much from documents to documents ; its equivalent in Zope’s world would be Silva. But Plone is much more appropriate for everyday CMS sites. Its object-orientation offers both a great flexibility for the developer and more ease of use for Joe-six-pack webmaster. Plone still lacks some important technical features for its future, namely ReSTful web service interfaces, plus placeless content paradigm. Versioning is expected soon.

This article was written in just one raw, late at night and with no re-reading reviewed once thanks to Gouri. It may be wrong or badly lacking information on some points. So your comments are much welcome !

From OWL to Plone

I found a working path to transform an OWL ontology into a working Plone content-type. Here is my recipe :

  1. Choose any existing OWL ontology
  2. With Protege equipped with its OWL plugin, create a new project from your OWL file.
  3. Still within Protege, with the help of its UML plugin, convert your OWL-Protege project into a UML classes project. You get an XMI file.
  4. Load this XMI file into an UML project with Poseidon. Save this project under the .zuml Poseidon format.
  5. From poseidon, export your classes a new xmi file. It will be Plone-friendly.
  6. With a text editor, delete some accentuated characters that Poseidon might have added to your file (for example, the Frenchy Poseidon adds a badly accentuated « Modele sans titre » attribute into your XMI) because the next step won’t appreciate them
  7. python Archgenxml.py -o YourProduct yourprojectfile.xmi turns your XMI file into a valid Plone product. Requires Plone and Archetypes (see doc) latest stable version plus ArchgenXML head from the subversion repository.
  8. Launch your Plone instance and install YourProduct as a new product from your Plone control panel. Enjoy YourProduct !
  9. eventually populate it with an appropriate marshaller.

Now you are not far from using Plone as a semantic aggregator.

An open source foundation for Fortune 500 companies

Another un-used business idea to recycle… please follow me :

  • Fortune 500 companies produce a lot of in-house developments, reinventing the wheel again and again, especially in the field of non-critical, non-business applications : reporting applications, technical asset management, collaborative tools, IT security, identity management, content management… Each of them is sitting on a consequent catalog of custom-made « commodity » applications that they develop and maintain on their own.
  • Most of these non-critical-non-business-in-house developments have a limited value but a high cost for businesses.
  • The main priority of corporate IT departments is to reduce their costs ; most of them rely on outsourcing applicative developments to some extent.
  • Open source is now somewhat trendy even among Fortune 500 corporations ; it is reaching a high level of visibility and acceptability.
  • The open source model proposes an optimization of the costs sketched above by sharing them among several users of commodity applications, i.e. by open sourcing them.
  • The « open sourcing » of custom-made applications by Fortune 500 companies would be an alternative to the classical outsourcing of these development and maintenance costs.
  • The offshore outsourcing of corporate development is seen as a threat on jobs for IT programmers in northern countries ; but the open sourcing of this software could be seen as more acceptable.
  • These IT departments are currently converging to common technological frameworks : .Net, J2EE, open source scripting ; that movement enhances their capability to absorb « foreign » developments ; and the standardization of their architectures tends to enhance the re-usability of their in-house developments.
  • Open source foundations are legal entities designed to own the intellectual property of open source applications, to guarantee that the open source licence they are distributed under will be enforced, to promote these applications so that their communities are thriving and the applications make gains in terms of reliability, quality and sustainability.
  • The distribution of software under open source licences is said to represent the highest value transfer ever from rich countries to developping countries.
  • The open sourcing of these Fortune 500 applications would be a positive change both for big corporations themselves and for smaller companies especially in third world countries.
  • Social entrepreneurship is becoming a hot topic today even in mainstream media ; this initiative might qualify as a social entrepreneurship initiative ; the public usefulness of such a move might justify the legal creation of a foundation in France.
  • Recently, in France, foundations have gained in acceptance by big businesses since a new tax law offers higher opportunities for tax reductions.
  • I still work as an IT manager in the global IT department for a Fortune 100 company in France ; our CIO would see such an open source foundation as a positive initiative but, as a cautious manager, he is dubious regarding the willingness of other CIOs of Fortune 500 companies (in France) to share their custom codes under an open source licence.

You work in/for a big corporation willing to reality-check this idea ? What do you think ?

The CMS pseudo-stock market

The Drupal people produced insightful stock-market-like statistics about the popularity of open source CMS packages (via the precious Amphi-Gouri). But their analysis mixes content management systems (Drupal, Plone) with blog engines (WordPress) and bulletin boards (phpBB). Anyway, it shows that :

  • « The popularity of most Free and Open Source CMS tools is in an upward trend.« 
  • Bulletin boards like phpBB is the most popular category, maybe the most mature and phpBB is the strong leader in this category
  • In the CMS category, Mambo, Xoops, Drupal and Plone are direct competitors ; Mambo is ahead in terms of popularity, Plone is behind its PHP competitors which certainly benefit from the popularity of PHP compared to Python; PHP-Nuke and PostNuke are quickly loosing some ground.
  • WordPress is the most dynamic open source blog engine in terms of growth of popularity ; its community is exploding

My conclusion :

  • if you want an open source bulletin board/community forum, then choose phpBB with no hesitation
  • if you want a real content management system and are not religiously opposed to Python, then choose Plone, else stick with PHP and go Mambo (or Xoops ?)
  • if you want an open source blog engine, then enjoy WordPress

If feel like producing this kind of statistical analysis about the dynamics of open source communities is extremely valuable for organization and people considering several open source options (cf. the activity percentile indicated on sourceforge projets as an example). I would tend to say that the strength of an open source community, measured in term of growth and size, is the one most important criteria to rely on when choosing an open source product.

Nowadays, the (real) stock market relies strongly on rating agencies. There must be a room (and thus a business opportunity) for an open source rating agency that would produce strong evidences about the relative strength of project communities.

What do you think ?

Web scraping with Python (part II)

The first part of this article dealt with retrieving HTML pages from the web with the help of a mechanize-propelled web crawler. Now your HTML pieces are safely saved locally on your hard drive and you want to extract structured data from them. This is part 2, HTML parsing with Python. For this task, I adopted a slightly more imaginative approach than for my crawling hacks. I designed a data extraction technology based on HTML templates. Maybe this could be called « reverse-templating » (or something like template-based reverse-web-engineering).

You may be used with HTML templates for producing HTML pages. An HTML template plus structured data can be transformed into a set of HTML pages with the help of a proper templating engine. One famous technology for HTML templating is called Zope Page Templates (because this kind of templates is used within the Zope application server). ZPTs use a special set of additional HTML tags and attributes referred to by the « tal: » namespace. One advantage of ZPT (over competing technologies) is that ZPT are nicely rendered in WYSIWYG HTML editors. Thus web designers produce HTML mockups of the screens to be generated by the application. Web developpers insert tal: attributes into these HTML mockups so that the templating engine will know which parts of the HTML template have to be replaced by which pieces of data (usually pumped from a database). As an example, web designers will say <title>Camcorder XYZ</title> then web developpers will modify this into <title tal:content= »camcorder_name »>Camcorder XYZ</title> and the templating engine will further produce a <title>Camcorder Canon MV6iMC</title> when it processes the « MV6iMC » record in your database (it replaces the content of the title element with the value of the camcorder_name variable as it is retrieved from the current database record). This technology is used to merge structured data with HTML templates in order to produce Web pages.

I took inspiration from this technology to design parsing templates. The idea here is to reverse the use of HTML templates. In the parsing context, HTML templates are still produced by web developpers but the templating engine is replaced by a parsing engine (known as web_parser.py, see below for the code of this engine). This engine takes HTML pages (the ones you previously crawled and retrieved) plus ZPT-like HTML templates as input. It then outputs structured data. First your crawler saved <title>Camcorder Canon MV6iMC</title>. Then you wrote <title tal:content= »camcorder_name »>Camcorder XYZ</title> into a template file. Eventually the engine will output camcorder_name = « Camcorder Canon MV6iMC ».

In order to trigger the engine, you just have to write a small launch script that defines several setup variables such as :

  • the URL of your template file,
  • the list of URLs of the HTML files to be parsed,
  • whether you would like or not to pre-process these files with an HTML tidying library (this is useful when the engine complains about badly formed HTML),
  • an arbitrary keyword defining the domain of your parsing operation (may be the name of the web site your HTML files come from),
  • the charset these HTML files are made with (no automatic detection at the moment, sorry…)
  • the output format (csv-like file or semantic web document)
  • an optional separator character or string if ever you chose the csv-like output format

The easiest way to go is to copy and modify my example launch script (parser_dvspot.py) included in the ZIP distribution of this web_parser.

Let’s summarize the main steps to go through :

  1. install utidylib into your python installation
  2. copy and save my modified version of BeautifulSoup into your python libraries directory (usually …/Lib/site-packages)
  3. copy and save my engine (web_parser.py) into your local directory or into you python libraries directory
  4. choose a set of HTML files on your hard drive or directly on a web site,
  5. save one of these files as your template,
  6. edit this template file and insert the required pseudotal attributes (see below for pseudotal instructions, and see the example dvspot template template_dvspot.zpt),
  7. copy and edit my example launch script so that you define the proper setup variables in it (the example parser_dvspot.py contains more detailed instructions than above), save it as my_script.py
  8. launch your script with a python my_script.py > output_file.cowl (or python my_script.py > output_file.cowl)
  9. enjoy yourself and your fresh output_file.owl or output_file.csv (import it within Excel)
  10. give me some feedback about your reverse-templating experience (preferably as a comment on this blog)

This is just my first attempt at building such an engine and I don’t want to make confusion between real (and mature) tal attributes and my pseudo-tal instructions. So I adopted pseudotal as my main namespace. In some future, when the specification of these reverse-templating instructions are somewhat more stabilized (and if ever the « tal » guys agree), I might adopt tal as the namespace. Please also note that the engine is somewhat badly written : the code and internal is rather clumsy. There is much room for future improvement and refactoring.

The current version of this reverse-templating engine now supports the following template attributes/instructions (see source code for further updates and documentation) :

  • pseudotal:content gives the name of the variable that will contain the content of the current HTML element
  • pseudotal:replace gives the name of the variable that will contain the entire current HTML element
  • (NOT SUPPORTED YET) pseudotal:attrs gives the name of the variable that will contain the (specified?) attribute(s ?) of the current HTML element
  • pseudotal:condition is a list of arguments ; gives the condition(s) that has(ve) to be verified so that the parser is sure that current HTML element is the one looked after. This condition is constructed as a list after BeautifulSoup fetch arguments : a python dictionary giving detailed conditions on the HTML attributes of the current HTML element, some content to be found in the current HTML element, the scope of research for the current HTML element (recursive search or not)
  • pseudotal:from_anchor gives the name of the pseudotal:anchor that is used in order to build the relative path that leads to the current HTML element ; when no from_anchor is specified, the path used to position the current HTML element is calculted from the root of the HTML file
  • pseudotal:anchor specifies a name for the current HTML element ; this element can be used by a pseudotal:from_anchor tag as the starting point for building the path to the element specified by pseudotal:from_anchor ; usually used in conjunction with a pseudotal:condition ; the default anchor is the root of the HTML file.
  • pseudotal:option describes some optional behavior of the HTML parser ; is a list of constants ; contains NOTMANDATORY if the parser should not raise an error when the current element is not found (it does as default) ; contains FULL_CONTENT when data looked after is the whole content of the current HTML element (default is the last part of the content of the current HTML element, i.e. either the last HTML tags or the last string included in the current element)
  • pseudotal:is_id_part a special ‘id’ variable is automatically built for every parsed resource ; this id variable is made of several parts that are concatenated ; this pseudotal:is_id_part gives the index the current variable will be used at for building the id of the current resource ; usually used in conjunction with pseudotal:content, pseudotal:replace or pseudotal:attrs
  • (NOT SUPPORTED YET) pseudotal:repeat specifies the scope of the HTML tree that describes ONE resource (useful when several resources are described in one given HTML file such as in a list of items) ; the value of this tag gives the name of a class that will instantiate the parsed resource scope plus the name of a list containing all the parsed resource

The current version of the engine can output structured data either as a CSV-like output (tab-delimited for example) or as an RDF/OWL document (of Semantic-Web fame). Both formats can easily be imported and further processed with Excel. The RDF/OWL format gives you the ability to process it with all the powerful tools that are emerging along the Semantic Web effort. If you feel adventurous, you may thus import your RDF/OWL file into Stanford’s Protege semantic modeling tool (or into Eclipse with its SWEDE plugin) and further process your data with the help of a SWRL rules-based inference engine. The future Semantic Web Rules Language will help at further processing this output so that you can powerfully compare RDF data coming from distinct sources (web sites). In order to be more productive in terms of fancy buzz-words, let’s say that this reverse-templating technology is some sort of a web semantizer. It produces semantically-rich data out of flat web pages.

The current version of the engine makes an extensive use of BeautifulSoup. Maybe it should have been based on a more XMLish approach instead (using XML pathes ?). But it would have implied that the HTML templates and HTML files to be processed should then have been turned into XHTML. The problem is that I would then have relied on utidylib but this library breaks too much some mal-formed HTML pages so that they are not valuable anymore.

Current known limitation : there is currently no way to properly handle some situations where you need to make the difference between two similar anchors. In some cases, two HTML elements that you want to use as distinct anchors have in fact exactly the same attributes and content. This is not a problem as long as these two anchors are always positioned at the same place in all the HTML page that you will parse. But, as soon as one of the anchors is not mandatory or it is located after a non mandatory element, the engine can get lost and either confuse the two anchors or complain that one is missing. At the moment, I don’t know how to handle this kind of situation. Example : long lists of specifications with similar names where some specifications are optional (see canon camcorders as an example : difference between lcd number of pixels and viewfinder number of pixels). The worst case scenario would be when there is a flat list of HTML paragraphs. The engine will try to identify these risks and should output some warnings in this kind of situations.


Here are the contents of the ZIP distribution of this project (distributed under the General Public License) :

  • web_parser.py : this is the web parser engine.
  • parser_dvspot.py : this is an example launch script to be used if you want to parser HTML files coming from the dvspot.com web site.
  • template_dvspot.zpt : this is the example template file corresponding to the excellent dvspot.com site
  • BeautifulSoup.py : this is MY version of BeautifulSoup. Indeed, I had to modify Leonard Richardson’s official one and I couldn’t obtain any answer from him at the moment regarding my suggested modifications. I hope he will soon answer me and maybe include my modifications in the official version or help me overcoming my temptation to fork. My modifications are based on the official 1.2 release of beautifulsoup : I added « center » as a nestable tag and added the ability to match the content of an element with the help of wildcards. You should save this BeautifulSoup.py file into the « Lib\site-packages » folder of your python installation.
  • README.html is the file you are currently reading, also published on my blog.

Economie de communion : utiliser l’entreprise pour rendre le monde meilleur

L’économie de communion est un concept issu d’un mouvement  de l’Eglise catholique appelé les Focolari. Autant les Focolari semblent avoir une saveur un peu hippie-catho/communautaire/charismatique qui peut faire peur ou rigoler, autant l’Economy of Communion (EoC) est un concept que je trouve très percutant et pertinent, notamment mis en perspective du phénomène open source avec lequel il partage de nombreux points communs du point de vue idéologique tout du moins. Il s’agit d’un concept d’autant plus percutant qu’il a déjà été adopté et mis en oeuvre au sein de plus de 700 PME en Italie, au Brésil et ailleurs.

Les ambitions de l’EoC

L’Ecole pour les Entrepreneurs de l’Economie de la Communion propose à des dirigeants d’entreprise de les aider à comprendre

comment il [leur a été] possible de […] faire les bons choix : en étant les premiers à aimer les autres, en induisant ainsi un amour réciproque au sein de la firme, ce qui à son tour attire l’attention des cieux et l’action de notre partenaire caché et divin directement au sein de l’entreprise

Tout un programme théologico-économique ! Mais qu’est-ce qui se cache sous ce jargon catho ? Les entrepreneurs de l’EoC se donnent comme objectif de

démontrer qu’il est effectivement possible d’appliquer la communion dans l’économie et montrer ainsi que ce nouveau comportement économique est basé sur une rationnalité plus large qui anticipe un modus operandi qui deviendra inévitable dans un futur raisonnable.

D’après le professeur Bruni,

L’EoC espère transformer les structures d’entreprise de l’intérieur en réussissant à éclairer les relations internes et externes des sociétés à la lumière d’un style de vie basé sur la communion c’est-à-dire sur le ‘don réciproque’ que ce terme implique […] Le challenge de l’EoC est de relire les pratiques organisationnelles quotidiennes à la lumière de cette notion de don et de communion.

Tout cela a vraiment l’air d’avoir quelque chose en commun avec les challenges (et l’idéologie) de l’économie open source : économie du don, pratiques communautaires, comportement et rationnalité économiques paradoxaux mais promis à un large succès… L’économie de la communion est-elle l’open source de l’Eglise catholique ?

Les difficultés auxquelles s’affronte l’EoC

Si les entreprises telles qu’on les connaît ne sont pas toujours des petits nids d’amour, c’est que

le raisonnement moral des gens [y] est coupé des réalités profondes de leurs vies, créant ainsi ce que de nombreuses personnes appellent « une vie fracturée ». C’est je pense l’une des raisons principales pour lesquelles construire des communautés authentiques au travail est si difficile. […] Il y a une forte tentation dans l’organisation de voir toute chose professionnelle […] comme un simple instrument d’accès aux profits ou au succès individuel. […] L’ubiquité [de ce type de rationnalité] exclut toute forme de rationnalité morale […] Cette rationnalité instrumentale tend à concentrer la responsabilité sociale de l’entreprise sur […] le don de bénéfices à des pauvres [(mécénat caritatif)], le don de temps personnel à des activités caritatives, la fourniture d’avantages divers aux employés, etc. au détriment de la manière dont s’effectue réellement le travail [quotidien] telle que la manière de rémunérer les gens, de concevoir les postes de travail, les processus de prise de décision, le marketing, les structures de propriété [(d’actionnariat)], la stratégie, le gouvernement d’entreprise, etc. L’instrumentalisme ambiant empêche la transformation morale et spirituelle de la manière dont chacun appréhende son travail et sa manière de travailler.

Du point de vue théorique, les tenants de l’EoC défendent que la théorie économique entre dans le champ plus large des théories morales, que l’idée selon laquelle la rationnalité économique est guidée par l’optimisation de ses stricts intérêts par l’individu a ses limites, qu’il n’est pas théoriquement impossible qu’un système économique tout entier puisse petit à petit voir des formes de rationalité économiques paradoxales devenir monnaie courante et qu’enfin une rationalité économique basée sur la notion chrétienne de communion (ou, plus largement, sur le don réciproque) est tout à fait… rationelle. Finalement, du point de vue théorique, l’EoC essaie de renvoyer l’homo economicus à la responsabilité morale qui guide forcément ses comportements économiques.

L’EoC en pratique

Les entreprises de l’EoC pratiquent la distribution d’une partie de leurs bénéfices à des oeuvres caritatives. Mais Lorna Gold explique que

la logique de communion [qui sous-tend l’EoC] ne se limite à cette dimension de distribution. Elle concerne la manière dont on traite les clients, la structure de tarification, la gestion de crises, la gestion des débiteurs etc. Très clairement, l’efficacité globale est essentielle, mais l’approche « au cas par cas » domine et est guidée par le désir de comprendre les besoins de son prochain.

En prenant l’exemple des politiques de rémunérations, Naughton note que

les rémunérations sont génératrices d’insatisfaction et non de satisfaction. Les rémunérations en elles-mêmes ne peuvent bâtir une communauté mais peuvent empêcher une communauté. […] [Du point de vue chrétien] le travail ne peut jamais être réduit au salaire versé. […] Il vaut mieux éviter de parler des salaires comme un échange [économique] mais plutôt comme éléments d’une relation de travail entre employeur et employé, relation qui a en son centre une dimension de don qui peut servir à renforcer une communauté professionnelle. [Cependant, il faut bien noter que] certains postes sont conçus tellement mal, de manière tellement idiote et bureaucratique qu’il devient très difficile [pour l’employé] de pouvoir faire preuve de don dans une telle situation. [Selon l’EoC, trois principes doivent guider les décisions de rémunérations :] satisfaire les besoins des employés (salaire minimal), reconnaître leurs contributions (salaire équitable), permettre un ordre économique durable pour l’entreprise (salaire durable).

Pour Michael Naughton, cet exemple des politiques de rémunérations illustre bien du point de vue de l’EoC

l’art de la réflexion de niveau intermédiaire qui fait le lien entre la théologie de la communion et les pratiques opérationnelles et quotidiennes de l’entreprise

Leo Andringa illustre la question des relations hiérarchiques dans l’entreprise et des modèles organisationnels en évoquant le fait que

l’organisation des entreprises est un « résidu de la société féodale ». Les idées révolutionnaires de liberté et d’égalité ont influencé l’Eglise, la famille et les institutions mais […] à l’opposé, n’ont pas touché l’essence capitaliste du système de l’entreprise. [… ] Du point de vue de la théorie de l’organisation, il est clair qu’une organisation […] ayant une finalité unique exprimée en cibles financières (chiffre d’affaires, bénéfices, trésorerie, valeur pour l’actionnaire) peut être relativement simple et […] très hiérarchique. [Mais] la principale motivation de l’entrepreneur EoC est de vivre la communion dans un environnement commercial. […] Plus l’objectif d’une organisation est complexe […] plus sa forme organisationnelle le sera.

Benedetto Gui précise que, dans l’EoC,

être un entrepreneur (ou un dirigeant ou quiconque ayant des responsabilités dans l’entreprise) est vu comme une véritable vocation : la vocation d’atteindre des valeurs élevées (et même des valeurs spirituelles) à travers l’accomplissement de tâches séculaires.

M. Andringa cite son expérience personnelle et ses relations en tant que patron avec son assistant :

Chaque fois que j’avais une décision importante à prendre pour l’entreprise, je lui faisais part de mes motivations et arguments. Il était une sorte de miroir pour moi. Lorsque je lui exposais mes arguments, je sentais immédiatement si ils tenaient ou non la route. En tant que directeur, c’était une expérience particulière que de prendre les décisions en [communion]. C’était une réalité que je vivais déjà dans ma vie privée et dans ma famille mais je la transposais pour la première fois dans la réalité de la direction d’une entreprise. […] En pratique, on voit qu’un grand nombre d’entrepreneurs veulent confronter leurs décisions d’importance avec autrui. […] Il est clair qu’une telle vision de l’entreprise ne peut se concrétiser sans la coopération de la majorité des actionnaires et la coopération ou la connaissance de la plus grande part des employés. Ce n’est que dans les entreprises où la communion est à tous les niveaux que ceux qui ont les rênes de l’entreprise peuvent être l’expression de la solidarité plutôt que celle de leur vision personnelle.

Le lien entre l’EoC et la responsabilité sociale des entreprises

Leo Andringa rappelle que

bien que de nombreuses multinationales ont gardé leurs oeillères fixées sur la croissance de leurs profits, nombre d’entre elles se sont impliquées dans l’augmentation de leur responsabilité sociale d’entreprise. […] La philosophie de l’EoC ne coïncide pas avec ce que l’on appelle maintenant la « Corporate Social Responsability » : en fait l’EoC a une responsabilité environnementale « par vocation » et non pour des buts de communication, d’image ou de [réponse à une] pression sociale. [L’EoC] exige beaucoup plus. […] Comment l’entreprise peut-elle réconcilier les intérêts de tous ceux qui dépendent d’elle : les actionnaires, les clients, les fournisseurs, la société civile ? Du point de vue [de l’EoC] il n’y a pas de réponse théorique à ce problème. Lorsqu’il est impossible de résoudre ce problème méthodologique à un niveau matériel, une solution est à trouver à un autre niveau [(spirituel, moral)].

Les atouts des entreprises EoC

Benedetto Gui explique que

les préoccupations [éthiques] des entreprises EoC font peser une charge additionnelle sur les dirigeants qui se sentent dans l’obligation implicite de garantir à leurs employés non seulement de bons emplois mais également des occasions de développer des relations interpersonnelles positives et de s’engager dans leurs activités professionnelles en accord avec leurs valeurs morales. Cependant, il y a un avantage à cet inconvénient : celui d’un surplus de motivation et de mobilisation de ressources volontaires. C’est grâce à ce phénomène que de nombreuses entreprises EoC survivent ou même connaissent le succès, malgré le « handicap » que représente leur adhésion à des principes de comportements telles que le respect de l’environnement, de la loi, etc.

Un témoignage dans le magazine de l’EoC illustre ce phénomène : Marcelle, responsable d’une toute petite exploitation agricole en Côte d’Ivoire, raconte sa surprise lorsqu’elle a constaté que ses ouvriers venaient prendre soin des plantations pendant leurs congés ou lorsque les événements politiques l’ont éloignée de son exploitation…

Avez-vous déjà repéré des sociétés françaises pratiquant l’économie de la communion ?

Web-SSO : A CAS client for Zope

The Central Authentication Service (aka CAS) is an open source lightweight framework that provides Web Single Sign On to big organizations (universities, agencies, corporations). It seems to be wildly used and seen as as much mature and reliable as the struts framework.

An existing server can benefit CAS WebSSO features if its technology is supported by a CAS client. So, please welcome Zope’s CAS User Folder, that SSOizes Zope within complex SSO infrastructures.

Zemantic: a Zope Semantic Web Catalog

Zemantic is an RDF module for Zope (read its announcement). From what I read (not tested by me yet), it implements services similar to zope catalogs and enables universal management of references (such as the Archetypes reference engine but in a more sustainable way). It is based on RDFLib, similarly to ROPE.

I feel enthusiastic about this product since it sounds to me like a good future-proof solution for the management of metadata, references and structured data within content management systems and portals. Plus Zemantic would sit well in my vision of Plone as a semantic aggregator.

Firefox gagne 5% des parts de marché de MS Internet Explorer

Le navigateur web Firefox, de la fondation Mozilla, a pris 5% des parts de marché des navigateurs web à son concurrent, Microsoft Internet Explorer. Et pourtant cela ne fait que quelques semaines que la version 1.0 de Firefox a été publiée. Mais la supériorité de Firefox sur IE est déjà largement vantée par la presse, ce qui accélère le mouvement de migration…
Ces données de part de marché sont publiées par OneStat, le « leader mondial des statistiques du web en temps réel », bref une source a priori fiable. Plus de commentaires, sur Slashdot.

Portails / CMS en J2EE

Pour créer un portail d’entreprise en J2EE, il y a le choix entre acheter un coûteux portail propriétaire (IBM ou BEA pour ne citer que les leaders des serveurs d’application J2EE) ou recourir à un portail J2EE open source. Mais autant l’offre open source en matière de serveurs d’application J2EE (JBoss, Jonas) atteint une certaine maturité qui la rend crédible pour des projets de grande envergure, autant l’offre open source en matière de portails J2EE semble largement immature. Ceci semble fermer à l’open source le marché des portails et de la gestion de contenu des grandes entreprises pour encore de nombreuses années.

Aux yeux de la communauté J2EE, des cabinets de conseil du secteur et des gros éditeurs, le meilleur produit du marché sera nécessairement celui qui supportera au moins les deux standards du moment : JSR 168 pour garantir la portabilité des portlets d’un produit à l’autre, et WSRP pour garantir l’interopérabilité des portlets distantes entre leur serveur d’application et le portail qui les agrège et les publie. Il y a donc dans cette gamme de produit une course à celui qui sera le plus dans la mode de la « SOA » (Service-Oriented Architecture). Comme portails J2EE open source, on cite fréquemment Liferay et Exo. Cette offre open source n’est pas étrangère à la fanfaronnade SOA (il faut bien marketer les produits, eh oui…). Du coup, l’effort de développement des portails J2EE open source semble davantage porter sur l’escalade de la pile SOA que sur l’implémentation de fonctionnalités utiles. C’est sûrement ce qui amène la communauté J2EE à constater que les portails J2EE open source manquent encore beaucoup de maturité et de richesse fonctionnelle surtout lorsqu’on les compare à Plone, leader du portail / CMS open source. En effet, Plone s’appuie sur un serveur d’application Python (Zope) et non Java (a fortiori non J2EE) ; il se situe donc hors de la course à JSR168 et semble royalement ignorer le bluff WSRP.

Nombreuses sont les entreprises qui s’évertuent à faire de J2EE une doctrine interne en matière d’architecture applicative. Confrontées au choix d’un portail, elles éliminent donc rapidement l’offre open source J2EE (pas assez mûre). Et, plutôt que de choisir un portail non J2EE reconnu comme plus mûr, plus riches en fonctionnalités et moins coûteux, elles préfèrent se cantonner à leur idéologie J2EE sous prétexte qu’il n’y a point de salut hors J2EE/.Net. Pas assez buzzword compliant, mon fils… Pfff, ne suivez pas mon regard… :-(

Linux vu par le Gartner

Le Gartner Group a récemment réalisé une enquête auprès de reponsables de centres de données. Voici quelques extraits des résultats de cette enquête : 42% des interrogés sont encore en train d’expérimenter Linux pour en évaluer l’intérêt pour eux, 34% ont adopté Linux dans leur centre après avoir reconnu ses avantages et sa maturité, 9% en sont déjà à une phase où ils en tirent des bénéfices tangibles. Chose plus étonnante, 30% d’entre eux prévoient de déployer Linux pour supporter des applications départementales ou sectorielles (et non pas seulement en tant qu’OS pour les serveurs d’infrastructure). L’adoption de Linux se fait principalement aux dépends d’Unix propriétaires mais également en partie aux dépends de serveurs Windows. Les trois fournisseurs préférés de service pour l’accompagnement de déploiements Linux en datacenters sont IBM, Red Hat et HP. Les principaux freins à l’adoption de Linux dans les datacenters sont surtout le manque de compétences des personnels mais aussi le manque d’applications et le manque d’outils d’administration et de supervision (la gestion de datacenters implique de forts besoin en la matière).

Solution open source de gestion des identités

Linagora propose une solution complète de gestion des identités électroniques appuyée sur des logiciels open source : InterLDAP (s’appuyant notamment sur AACLS).
Avantages : une couverture fonctionnelle très large (WebSSO, gestion de listes de diffusion, administration déléguée, infrastructure d’annuaires avec réplication, flexibilité extrême dans la gestion des règles de contrôle d’accès, user provisioning…), un coût de licences nul, des choix technologiques prudents et « industriels » (J2EE, OpenLDAP, respect des standards ouverts…).
Inconvénients : encore peu « packagée » d’un point de vue marketing, pas de grosse référence dans le secteur privé, difficile à acheter par un acteur privé (cf. ci-dessous), difficilement comparable avec un produit propriétaire concurrent.

Pour reprendre des thèmes évoqués avec Nicolas Chauvat de Logilab lors d’une récente conversation, ce type d’offres souffre de défauts communs à la plupart des offres produits open source lorsqu’il s’agit de les vendre notamment au secteur privé. Comment les acheteurs de grandes sociétés du CAC 40 peuvent-ils évaluer leur performance (et donc leur prime de fin d’année ?), eux qui ont l’habitude de la mesurer à l’aune du différentiel entre le prix public des licences et le prix négocié ? Les modes d’achats logiciels des gros acteurs privés ne sont vraiment pas adaptés à l’achat de solutions open source. En effet, le coût de licence étant nul, c’est l’évaluation d’un coût total de possession qui peut seul permettre la comparaison entre une offre open source et une offre propriétaire. Or, outre le fait que les modèles de TCO sont généralement peu fiables ou alors très coûteux à mettre en place, il est difficile de prévoir le coût des développements spécifiques/personnalisation. L’achat d’un développement au forfait suppose normalement que l’acheteur soit capable de fournir un cahier des charges fonctionnel du besoin détaillé et stable pour que le fournisseur en concurrence puisse s’engager sur un coût prévisionnel et prenne à sa charge le coût du risque lié à cette prévision. Mais le problème, c’est que l’acheteur est généralement incapable de fournir un tel dossier car les services informatiques sont généralement trop contraints par les délais et l’imprévisibilité des maîtrises d’ouvrage pour pouvoir formaliser ce cahier des charges dans de bonnes conditions. Cela entraîne des pratiques d’achat dans lesquelles on compare d’une part un coût de licence et de support et d’autre part un coût du jour.homme pour le développement spécifique. Dès lors comment comparer un produit propriétaire pour lequel l’essentiel du coût présenté à l’achat est celui de la licence avec un produit open source pour lequel l’essentiel du coût présenté à l’achat est celui de l’intégration ?

Etant données la marge d’incertitude liée aux spécifications fonctionnelles (imprécises, peu stables), la marge d’incertitude liée au modèle de calcul du TCO et la marge d’incertitude liée à l’évaluation de l’adéquation d’un produit au besoin fonctionnel, il paraît relativement vain aux acteurs privés de vouloir considérer des solutions open source dans leurs appels d’offres. Ce n’est que lorsque la nature de la relation du client à son fournisseur est un élément stratégique de décision que le choix open source paraît devenir évident. Lorsque le client cherche une solution qui lui donne plus d’autonomie vis-à-vis de l’éditeur, c’est alors qu’il peut voir la valeur d’une offre open source. Mais, même dans ce cas, peut-on réellement quantifier cette valeur ? Est-elle défendable devant des acheteurs et des décideurs du secteur privé ? Sans doute. Pourtant, je n’ai encore jamais vu d’argumentaire marketing soutenu par un modèle financier solide et reconnu par la presse professionnelle qui arrive à démontrer cette valeur. Les entreprises françaises croient encore qu’open source signifie soit gratuit, soit bricolé par des non professionnels, soit « non industriel » ou, dans le meilleur des cas, croient que l’open source signifie « prétenduement vraiment moins cher ». Plus encore, je pense que la mentalité générale consiste à considérer l’open source comme « non gérable ». Lorsque cette mentalité aura changé, c’est que les entreprises du secteur de l’open source auront réussi.

PS : A ces difficultés, il faut ajouter le fait que la plupart des SS2L regroupent des passionnés des technologies qui n’ont pas forcément les compétences marketing et de vente que réunissent habituellement les éditeurs propriétaires. Mais ce dernier point n’est sans doute qu’un péché de jeunesse ?

Qui profitera de l’open source ?

A l’occasion d’un très stimulant déjeuner avec Nicolas Chauvat, de Logilab, nous avons évoqué à coups de machettes les différents modèles économiques pour l’open source et la quantité d’innovation portée par chaque modèle (« l’open source permet-il d’innover ? ») :

  • comme les éditeurs propriétaires : développer une fois, revendre plusieurs fois => mais cela suppose que ni l’entreprise ni ses clients ne publient le code, du coup on perd certains bénéfices du modèle open source (open scrutiny, mutualisation des coûts de maintenance, notoriété, …), de plus l’objectif est alors de rentabiliser l’existant (« vaches à lait ») avant/plutôt que de développer du nouveau donc pas trop d’innovation à attendre
  • comme les SSII : vendre du jour.homme => mais cela n’apporte pas de différence essentielle avec le fait de vendre du jour.homme sur des technos propriétaires, l’entreprise ne capitalise pas sur le code ou la communauté, pas d’innovation de la part de l’entreprise qui se « contente » de répondre au besoin du client ; est-ce vraiment le cas chez un Cap Gemini ou Unilog par exemple, qui ont eu tendance à vendre du J.H open source ces derniers temps ?
  • développer un produit (capital), vendre du jour.homme pour l’intégrer et le personnaliser (voire le développer) puis le maintenir/supporter => c’est le modèle que semblent suivre les SS2L françaises : IdealX, Ingeniweb, Nuxeo, Linagora, Clever AgeLogilab et les autres ; de l’innovation mais… reste encore difficile à vendre (sauf, peut-être au secteur public, et encore) !
  • utiliser des outils open source pour porter une offre innovante de services non informatique => exemple : nombreux fournisseurs d’accès, comme Free, de l’innovation, mais y a-t-il une réelle contribution open source (ou au contraire/uniquement des modifications privées du code) ?

Alors, toujours avec ma question en tête « chez quelle boîte est-ce qu’il faut aller travailler pour surfer sur la (future ?) vague open source ? ». Chez les éditeurs propriétaires : ne profitent pas assez du modèle open source ? Chez une SSII classique : pas de stratégie d’innovation ? Chez une SS2L : trop tôt pour avoir une activité avec des volumes suffisants ? Chez un fournisseur de services exploitant l’open source : y en a-t-il de respectables qui aient choisi l’open source ?

Je pense que l’avenir se montrera rose pour ces deux dernières catégories : les SS2L, une fois qu’elles auront appris à vendre leur offre au CAC40 (et que le CAC40 aura appris à la leur acheter !), les fournisseurs de services innovants utilisateurs de l’open source une fois que je saurai si ça existe. Ces deux modèles semblent être les plus « stables » comme dit Nicolas, les plus « durables » je dirais.

Tiens, une idée qui fait plop : le développement/la maturation de l’open source part de l’infrastructure (Linux, Apache, …) et « monte » vers l’applicatif (Evolution, Plone, …) ; de nombreux ISP ont choisi l’open source pour monter une infrastructure permettant de monter des offres de services innovantes ; quelles seront les entreprises qui choisiront l’open source pour s’équiper en applicatifs permettant de monter des offres de services innovantes. Autrement dit, qui fera d’un Plone le même usage qu’un Free peut faire des linux (ou quel que soit le BSD qu’ils utilisent…) ? Est-ce qu’un Sharing Knowledge décidera d’open sourcer ses outils logiciels ? Est-ce qu’un Ingeniweb pourrait tenir une telle position ?

Computer Associates a misé sur Zope. Pourquoi ?

Computer Associates a misé sur Zope, de manière notamment à prendre position dans la tendance open source du moment. De tous les points de vue, je suppose que c’est une très bonne chose (pour Zope, pour Plone, pour CA, …)

Cependant, je suis étonné par ce que j’entends dire par mes collègues au sujet de Computer Associates. On me décrit cette société comme une entreprise de requins : en France, un repaire de commerciaux aux dents longues pour lesquels seul le chiffre compte ; et surtout une entreprise dont la stratégie consisterait à racheter des acteurs innovants pour les étouffer dans l’oeuf. Bref, les différents échos que l’on m’a rapporté à leur sujet font état d’une caricature de gros éditeurs sans âme et poussant à l’extrême les pratiques anti-concurrentielles si souvent décriées dans le secteur.

J’imagine que ces échos comportent sans doute une bonne part d’exagération et peut-être une once de mauvais esprit (ils ne peuvent pas être si mauvais, dites-moi ???). Néanmoins, il doit bien y avoir aussi une part de vrai. Alors que penser d’un acteur de ce type qui se rapproche du monde de l’open source. Comment un éditeur aux dents longues peut-il espérer profiter de communautés open source ?

Ma question n’est pas de savoir si CA est « gentil » ou « méchant », « bon » ou « mauvais ». Ma question est plutôt de savoir : est-ce qu’un acteur de ce type peut réellement lier des liens profitables et durables avec une communauté open source ? Un éditeur tel que CA peut-il réellement trouver sa place dans un écosystème open source sans le dégrader ou, plus vraisemblablement, s’en faire gentiment éjecter avec le temps ? Ou bien, au-delà de l’effet d’annonce, les gens de Computer Associates ont-ils réellement pris conscience des spécificités et de l’intérêt stratégique de l’open source ? Sont-ils prêts à faire les sacrifices (investissements et conduite interne du changement) nécessaires pour réellement profiter des opportunités de l’open source ? Autrement dit : CA = profiteur qui fait un mouvement par effet de mode ? ou bien CA = visionnaire ?

Plone as a semantic aggregator

Here is an output of my imagination (no code, sorry, just a speech) : what if a CMS such as Plone could be turned into a universal content aggregator. It would become able to retrieve any properly packaged content/data from the Web and import it so that it can be reused, enhanced, and processed with the help of Plone content management features. As a universal content aggregator, it would be able to « import » (or « aggregate ») any content whatever its structure and semantic may be. Buzzwords ahead : Plone would be a schema-agnostic aggregator. It would be a semantic-enabled aggretor

Example : On site A, beer-lovers gather. Site A’s webmaster has setup a specific data schema for the description of beers, beer flabours, beer makers, beer drinkers, and so on. Since site A is rich in terms of content and its community of users is enthusiastic, plenty of beers have been described there. Then site B, powered by a semantic aggregator (and CMS), is interested in any data regarding beverages and beverages impact on human’s health. So site B retrieves beer data from site A. In fact it retrieves both the description of beer1, beer2, beerdrinker1, … and the description of what a beer is, how data is structured when it describes a beer, what the relationship is between a beer and a beer drinker. So site B now knows many things about beer in general (data structure = schema) and many beers specifically (beers data). All this beer data on site B is presented and handled as specific content types. Site B’s users are now able to handle beer descriptions as content items, to process them through workflows, to rate them, to blog on them, and so on. And finallly to republish site B’s own output in such a way it can be aggregated again from other sites. That would be the definitive birth of the semantic web !

There are many news aggregators (RSSBandit, …) that know how to retrieve news items from remote sites. But they are only able to aggregate news data. They only know one possible schema for retrievable data : the structure of a news item (a title + a link + a description + a date + …). This schema is specified in the (many) RSS standard(s).

But now that CMS such as Plone are equipped with schema management engines (called « Archetypes » for Plone), they are able to learn new data schema specified in XML files. Currently, Plone’s archetypes is able to import any schema specified in the form of an XMI file output by any UML modelizing editor.

But XMI files are not that common on the Web. And the W3C published some information showing that any UML schema (class diagram I mean) is the equivalent of an RDF-S schema. And there even is a testbed converter from RDF-S to XMI. And there even are web directories inventoring existing RDF schemas as RDF-S files. Plus RSS 1.0 is based on RDF. Plus Atom designers designed it in such a way it is easily converted to RDF.

So here is my easy speech (no code) : let’s build an RDF aggregator product from Plone. This product would retrieve any RDF file from any web site. (It would store it in the Plone’s triplestore called ROPE for instance). It would then retrieve the associated RDF-S file (and store it in the same triplestore). It would convert it to an XMI file and import it as an Archetypes content type with the help of the ArchGenXML feature. Then it would import the RDF data as AT items conforming to the newly created AT content type. Here is a diagram summarizing this : Plone as a semantic aggregator

By the way, Gillou (from Ingeniweb) did not wait for my imagination output to propose a similar project. He called it ATXChange. The only differences I see between his proposal and what is said above are, first, that Gillou might not be aware about RDF and RDF-S capabilities (so he might end with a Archetypes-specific aggregator inputting and outputting content to and from Plone sites only) and that Gillou must be able to provide code sooner or later whereas I may not be !

Last but not least : wordpress is somewhat going in the same direction. The semweb community is manifesting some interest in WP structured blogging features. And some plugins are appearing that try to incorporate more RDF features in WP (see also seeAlso).

Free Text-to-speech technologies

Here is a short review of freely available (open source or not) « text-to-speech » technologies. I digged in this topic because I wanted to check whether anyone invented some software package turning my RSS aggregator into a personalized radio. More precisely, while I am doing some other task (feeding one of my kids, brushing my teeth, having my breakfast, …) I would like to be able to check my favorite blogs for news without having to read stuff. My conclusion : two packages come near to the expected result.

Regarding features, the most advanced one is NewsAloud from nextup.com. It acts as a simple and limited news aggregator combined with a text-to-speech engine that reads selected newsfeeds loud. But it still lacks some important features (loading my OPML subscription file so that I don’t have to enter my favorites RSS feeds one by one, displaying a scrolling text as it is read, …) and worst : it is NOT open source.

The second nice-looking package going in the expected direction is just a nice hack called BlogTalker and enabling any IBlogExtension-compatible .Net aggregator (RSSBandit, NewsGator…) to read any blog entry. But it is just a proof-of-concept since it cannot be setup so that it reads a whole newsfeed nor any set of newsfeeds. It seems to me that adding TTS abilities to existing news aggregators is the way to go (compared to NewsAloud which is coming from TTS technologies and trying to build a news aggregator from there). And BlogTalker passes successfully the « is it open source ? » test.

Both packages depend on third party text-to-speech engines (the « voices » you install on your system). As such, they are dependent on the quality of the underlying TTS engine. For example, if you are a Windows user, you can freely use some Microsoft voices (Mike, Mary, Sam, robot voices, …) or Lernout & Hauspie voices or many other freely available TTS engines that support the Microsoft Speech API version 4 or 5 (or above ?). The problem is that these voices do not sound good enough to me. As a native French speaker, I am comfortable with the LH Pierre or LH Veronique French voices even if they still sound like automat voices. But for listening to English newsfeeds on the long run, the MS, LH or other voices are not good enough. Fortunately, AT&T invented its « natural voices » which sound extremely … natural according to the samples provided online. Unfortunately, you have to purchase them. I will wait for this new kind of natural voices to become commoditized.

Meanwhile, I have to admit that TTS-enabled news aggregators are not ready for end-users. You can assemble a nice proof-of-concept but the quality is still lacking with the above three issues : aggregators are not fully mature (from a usability point-of-view), high-quality TTS engines are still rare, nobody has achieved to integrate them well one with the other yet. With the maturation of audio streaming technologies, I expect some hacker some day to TTS-enable my favorite CMS : Plone. With the help some of the Plone aggregation modules (CMFFeed, CMFSin, …), it would be able to stream personalized audio newsfeed directly to WinAmp… Does it sound like a dream ? Not sure…

During my tests, I encountered several other TTS utilities that are open source (or free or included in Windows) :

  • Windows Narrator is a nice feature that reads any Windows message box for more accessibility. It seems to be bundled in all the recent Windows releases. Windows TTS features are also delivered with the help of the friendly-but-useless Microsoft Agents.
  • Speakerdaemon‘s concept is simple : it monitors any set of local files or URLs and it speaks a predefined message at any change in the local or remote resource (« Your favorite weblog has been updated ! »). Too bad it cannot read the content or excerpts (think regular expressions) of these resources.
  • SayzMe sits in your icon tray and reads any text that is pasted by Windows into the clipboard. Limited but easy.
  • Clip2Speech offer the same simple set of features as SayzMe plus it allows you to convert text to .WAV files.
  • Voxx Open Source is somewhat ambitious. It offers both TTS features (read any highlighted text when you hit Ctrl-3, read message boxes, read any text file, convert text to .WAV or .MP3, …) and speech recognition. Once again, it is « just » a packaging (front-end) of third party speech recognition engines. As such, it uses by default Microsoft Speech recognizer which is not available in French (but in U.S. English, Chinese and Japanese if I remember properly). I have still to try it in its U.S. English with a headset microphone since my laptop microphones catches too much noise for it to be usable. The speech recognition feature allows the user to dictate a text or to command Voxx or Windows via voice. So it is an open source competitor to IBM ViaVoice or ScanSoft Dragon Naturally Speaking.
  • PhantomSpeech is middleware that plugs into TTS engines and allows application developers to add TTS capabilities to their applications. It is said to be distributed with addins for Office 2000. Indeed I could display a PhantomSpeech toolbar in Word 2003. It could read a text but only using the female Microsoft voice. And this toolbar had unexpected behaviors and errors within Office. Not reliable as a front-end application. Anyway, the use and configuration of speech engines is really a mess. The result is that PhantomSpeech does not look as really intended for end-users but maybe just for developers.
  • CHIPSpeaking is a nice utility for « the vocally disabled » (people who cannot speak). It allows the user to dictate sentences with a virtual keyboard and to record predefined sentences that are read aloud with one click.
  • ReadPlease (the free version) is just a nice simple text reader made by developers who played too much with Warcraft (click on the faces and you’ll see why). The word being read is highlighted. Simple options allow users to change the voices with one click (which is cool when you switch between several languages) or to customize the size of the text, …
  • Spacejock’s yRead is another text reader that includes a pronunciation editor (read km as « kilometers » please) and also allows the download of public domain texts available from Project Gutenberg. The phrase being read is highlighted, you can easily switch from one voice (and language) to another. Too bad its Window always sucks the focus when it reads a new phrase.
  • For the *nix-inclined people, I should also mention the famous Festival suite of TTS components (Festival, FLite, Festvox). For the java-inclined people, don’t miss the FreeTTS engine (that is based on Festival Lite !) and the associated end-user applications. An example of an end-user application based on Festival is the CMU Communicator, see its sample conversation as a demo.
  • Last but not least, do not miss Euler and the underlying MBROLA package. Euler is a simple open source reading machine based on MBROLA that implements a huge number of voices in many many languages plus these voices can include very natural intonations and vocal stresses. Euler + MBROLA were produces by an academic research program. They are free for non-commercial use and their source code is available (BTW, it is said that MBROLA could not be distributed under an open source license because of a France Telecom software patent !). Beware : the installation of MBROLA may be quite tricky. First, download the MBROLATools Windows binaries package, download patch #1 and read the instructions included, (I had problems when trying patch #2 so I did not use it), download as many MBROLA voices as you want (wow ! that many languages supported !), then download Win Euler (or any other MBROLA compatible TTS engine from third parties ; note that MBROLA is supported by festival).

Further ranting about TTS engines : I feel like the ecosystem of speech engines is really not mature enough. Sure several vendors provide speech engines. But they are not uniformly supported by the O.S.. There was a Microsoft S.A.P.I. version 4 (SDK available here) which is now version 5.1 but people even mention v.6 (included in Office 2003 U.S. ?) and a v.7 to be included in LongHorn (note that there also is another TTS API : the Java Speech API 1.0 – JSAPI– as implemented by FreeTTS… bridgeable with MS SAPI ?). But as any Microsoft standard, these API are … not that standardized (i.e. they seem to be Microsoft-specific). Even worst : they seem rather unstable since the installation of various speech engines give strange results : some software detects most of the installed TTS engines, other only detect SOME of the SAPI v.4 TTS engines, some other display a mix of some of your SAPI4 TTS engines and some of your SAPI5 TTS engines…. In order to be able to use SAPI5 engines and control panel I had to install Microsoft Reader and to TTS-enable it (additional download). What a mess ! The result is that you cannot easily control which voices you will be using on your computer (which one will be supported or not ?). As a further example, I could not install and use the free CMU Kal Diphone voice and I still don’t know why. Is it the API fault ? the engine’s fault ? Don’t know… Last remark on this point : Festival seems to be the main open source stream in the field of TTS technologies but it does not seem to be fully mature ; and the end-user applications based on it seem to be quite rare. Let’s wait some more years before it becomes a mainstream, user-friendly and free technology.

More precisely, the TTS puzzle seems to be made with the following parts :

  • a TTS engine made with three parts :
    • a text processing system that takes a text as input and produces phonetic and prosodic (duration of phonemes and a piecewise linear description of pitch) commands
    • a speech synthesizer that transforms phonemes plus a prosody (think « speech melody ») into a speech
    • a « voice » that is a reference database that allows the speech to be synthesized according to the tone, characteristics and accent of a given voice
  • an A.P.I. that hosts this engine and publishes its features toward end-user applications and may provide some general features such as a control panel
  • an end-user application (a reading machine, a file monitor with audio alerts, a audio news aggregator, …) that invokes the dedicated speech API

You can get more detailed information from the MRBOLA project site.

These were my notes and ranting about text-to-speech technologies. Please drop me a comment if you feel like my explanations were wrong or biased as I don’t know this field in details and I may have made a lot of errors here. Thanks !

Pourquoi Jonas plutôt que JBoss

Voici une liste d’arguments de comparaison entre deux serveurs d’applications J2EE open source : JBoss et Jonas. Je trouve cette liste intéressante car elle pointe les problématiques qui me semblent les plus importantes pour un projet d’entreprise autour d’un produit open source : disponibilité du support, de la documentation, etc. Selon moi, pour le lecteur, cette argumentation devrait être plus importante que le choix lui-même.

Merci, Thomas, de m’avoir indiqué ce lien !

Decentralized organizations centralizing their IT architecture = 0,1% chance of success


Reinout van Rees says

I’ve had enough of all those pictures in powerpoint presentations showing the One Central Database Or Application that would solve all communication problems in a building project.

Is this a coincidence if I feel the same and I work in a similar industry ? It is a hard job to convince this industry of the benefits of spontaneous integration and the adequacy of the open source model to support it ! Come on Reinout, let’s build the Spontaneously Integrated Front of Really Loosely Coupled Architects for the Building Industry ! ;-) Need to find a better name : SIFRLCABI does not sound well enough, even in French or in Dutch.

Ingeniweb réinvente RSS

Le Zope Service Provider Ingeniweb publie RSSSearch, un nouveau composant pour Plone qui tranforsme la fonctionnalité de recherche de Plone en méta-moteur de recherche : les résultats affichés proviennent non seulement du site interrogé mais également de plusieurs sites distants. OK.

Ce qui est amusant, c’est qu’au passage Ingeniweb invente encore une nouvelle signification pour l’acronyme RSS. En effet, à l’origine, RSS signifiait Rich Site Summary puis certains l’ont renommé Really Simple Syndication et d’autres RDF Site Summary. Et voila qu’Ingeniweb ajoute sa sauce : Research Support System ! Décidément, chacun invente son RSS. Mmm… Est-ce une blague ?