Archives pour la catégorie Informatique

Daisy vs. Plone, feature fighting

A Gouri-friend of mine recently pointed me to Daisy, a « CMS wiki/structured/XML/faceted » stuff he said. I answered him it may be a nice product but not enough attractive for me at the moment to spend much time analyzing it. Nevertheless, as Gouri asked, let’s browse Daisy’s features and try to compare them with Plone equivalents (given that I never tried Daisy).

The Daisy project encompasses two major parts: a featureful document repository

Plone is based on an object-oriented repository (Zope’s ZODB) rather than a document oriented repository.

and a web-based, wiki-like frontend.

Plone has its own web-based fronted. Wiki features are provided with an additional product (Zwiki).

If you have different frontend needs than those covered by the standard Daisy frontend, you can still benefit hugely from building upon its repository part.

Plone’s frontend is easily customizable either with your own CSS, with inherting from existing ZPT skins or with a WYSIWYG skin module such as CPSSkin.

Daisy is a Java-based application

Plone is Python-based.

, and is based on the work of many valuable open source packages, without which Daisy would not have been possible. All third-party libraries or products we redistribute are unmodified (unforked) copies.

Same for Plone. Daisy seems to be based on Cocoon. Plone is based on Zope.

Some of the main features of the document repository are:
* Storage and retrieval of documents.

Documents are one of the numerous object classes available in Plone. The basic object in Plone is… an object that is not fully extensible by itself unless it was designed to be so. Plone content types are more user-oriented than generic documents (they implement specialized behaviours such as security rules, workflows, displays, …). They will be made very extensible when the next versions of the « Archetypes » underlying layer is released (they include through-the-web schema management feature that allow web users to extend what any existing content type is).

* Documents can consists of multiple content parts and fields, document types define what parts and fields a document should have.

Plone’s perspective is different because of its object orientation. Another Zope product called Silva is more similar to Daisy’s document orientation.

Fields can be of different data types (string, date, decimal, boolean, …) and can have a list of values to choose from.

Same for Archetypes based content types in Plone.

Parts can contain arbitrary binary data, but the document type can limit the allowed mime types. So a document (or more correctly a part of a document) could contain XML, an image, a PDF document, … Part upload and download is handled in a streaming manner, so the size of parts is only limitted by the available space on your filesystem (and for uploading, a configurable upload limit).

I imagine that Daisy allows the upload and download of documents having any structure, with no constraint. In Plone, you are constrained by the object model of your content types. As said above this model can be extended at run time (schema management) but at the moment, the usual way to do is to define your model at design time and then comply with it at run time. At run time (even without schema management), you can still add custom metadata or upload additional attached files if your content type supports attached files.

* Versioning of the content parts and fields. Each version can have a state of ‘published’ or ‘draft’. The most recent version which has the state published is the ‘live’ version, ie the version that is displayed by default (depends on the behaviour of the frontend application of course).

The default behaviour of Plone does not include real versioning but document workflows. It means that a given content can be in state ‘draft’ or ‘published’ and go from one state to another according to a pre-defined workflow (with security conditions, event triggering and so). But a given object has only one version by default.
But there are additional Plone product that make Plone support versioning. These products are to be merged into Plone future distribution because versioning has been a long awaited feature. Note that, at the moment, you can have several versions of a document to support multi-language sites (one version per language).

* Documents can be marked as ‘retired’, which makes them appear as deleted, they won’t show up unless explicitely requested. Documents can also be deleted permanently.

Plone’s workflow mechanism is much more advanced. A default workflow includes a similar retired state. But the admin can define new workflows and modify the default one, always referring to the user role. Plone’s security model is quite advanced and is the underlying layer of every Plone functionality.

* The repository doesn’t care much what kind of data is stored in its parts, but if it is « HTML-as-well-formed-XML », some additional features are provided:
o link-extraction is performed, which allows to search for referers of a document.
o a summary (first 300 characters) is extracted to display in search results
o (these features could potentially be supported for other formats also)

There is no such thing in Plone. Maybe in Silva ? Plone’s reference engine allows you to define associations between objects. These associations are indexed by Plone’s search engine (« catalog ») and can be searched.

* all documents are stored in one « big bag », there are no directories.

Physically, the ZODB repository can have many forms (RDBMS, …). The default ZODB repository is a single flat file that can get quite big : Data.fs

Each document is identified by a unique ID (an ever-increasing sequence number starting at 1), and has a name (which does not need to be unique).

Each object has an ID but it is not globally unique at the moment. It is unfortunately stored in a hierarchical structure (Zope’s tree). Some Zope/Plone developpers wished « Placeless content » to be implemented. But Daisy must still be superior to Plone in that field.

Hierarchical structure is provided by the frontend by the possibility to create hierarchical navigation trees.

Zope’s tree is the most important structure for objects in a Plone site. It is too much important. You can still create navigation trees with shortcuts. But in fact, the usual solution in order to have maximum flexibility in navigation trees is to use the « Topic » content type. Topics are folder-like object that contain a dynamic list of links to objects matching the Topic’s pre-defined query. Topic are like persistent searches displayed as folders. As a an example a Topic may display the list of all the « Photo »-typed objects that are in « draft » state in a specific part (tree branch) of the site, etc.

* Documents can be combined in so-called « collections ». Collections are sets of the documents. One document can belong to multiple collections, in other words, collections can overlap.

Topics too ? I regret that Plone does easily not offer a default way to display a whole set of objects in just one page. As an example, I would have enjoyed to display a « book » of all the contents in my Plone site as if it were just one single object (so that I can print it…) But there are some Plone additional products (extensions) that support similar functionalities. I often use « Content Panels » to build a page by defining its global layout (columns and lines) and by filling it with « views » from exisiting Plone objects (especially Topics). Content Panels mixed with Topics allow a high flexibility in your site. But this flexibility has some limits too.

* possibility to take exclusive locks on documents for a limitted or unlimitted time. Checking for concurrent modifications (optimistic locking) happens automatically.

See versioning above.

* documents are automatically full-text indexed (Jakarta Lucene based). Currently supports plain text, XML, PDF (through PDFBox), MS-Word, Excel and Powerpoint (through Jakarta POI), and OpenOffice Writer.

Same for Plone except that Plone’s search engine is not Lucene and I don’t know if Plone can read OpenOffice Writer documents. Note that you will require additional modules depending on your platform in order to read Microsoft files.

* repository data is stored in a relation database. Our main development happens on MySQL/InnoDB, but the provisions are there to add support for new databases, for example PostgreSQL support is now included.

Everything is in the ZODB. By default stored as a single file. But can also be stored in a relational database (but this is usually useless). You can also transparently mix several repositories in a same Plone instance. Furthermore, instead of having Plone directly writing in the ZODB’s file, you can configure Plone so that it goes through a ZEO client-server setup so that several Plone instances can share a common database (load balancing). Even better, there is a commercial product, ZRS, that allows you to transparently replicate ZODBs so that several Plone instances setup with ZEO can use several redundant ZODBs (no single point of failure).

The part content is stored in normal files on the file system (to offload the database). The usage of these familiar, open technologies, combined with the fact that the daisywiki frontend stores plain HTML, makes that your valuable content is easily accessible with minimal « vendor » lock-in.

Everything’s in the ZODB. This can be seen as a lock-in. But it is not really because 1/ the product is open source and you can script a full export with Python with minimal effort, 2/ there are default WebDAV + FTP services that can be combined with Plone’s Marshall extension (soon to be included in Plone’s default distribution) that allows you to output your content from your Plone site. Even better, you can also upload your structured semantic content with Marshall plus additional hacks as I mentioned somewhere else.

* a high-level, sql-like query language provides flexible querying without knowing the details of the underlying SQL database schema. The query language also allows to combine full-text (Lucene) and metadata (SQL) searches. Search results are filtered to only contain documents the user is allowed to access (see also access control). The content of parts (if HTML-as-well-formed-XML) can also be selected as part of a query, which is useful to retrieve eg the content of an « abstract » part of a set of documents.

No such thing in Plone as far as I know. You may have to Pythonize my friend… Except that Plone’s tree gives an URL to every object so that you can access any part of the site. But not with a granularity similar to Daisy’s supposed one. See silva for more document-orientation.

* Accesscontrol: instead of attaching an ACL to each individual document, there is a global ACL which allows to specify the access rules for sets of documents by selecting those documents based on expressions. This allows for example to define access control rules for all documents of a certain type, or for all documents in a certain collection.

Access control is based on Plone’s tree, with inheritance (similar to Windows security model in some way). I suppose Plone’s access control is more sophisticated and maintainable than Daisy’s one but it should require more investigation to explain why.

* The full functionality of the repository is available via an HTTP+XML protocol, thus providing language and platform independent access. The documentation of the HTTP interface includes examples on how the repository can be updated using command-line tools like wget and curl.

Unfortunately, Plone is not ReST enough at the moment. But there is some hope the situation will change with Zope 3 (Zope’s next major release that is coming soon). Note that Zope (so Plone) supports HTTP+XML/RPC as a generic web service protocol. But this is nothing near real ReSTful web services…

* A high-level, easy to use Java API, available both as an « in-JVM » implementation for embedded scenarios or services running in the daisy server VM, as well as an implementation that communicates transparently using the HTTP+XML protocol.

Say Python and XML/RPC here.

* For various repository events, such as document creation and update, events are broadcasted via JMS (currently we include OpenJMS). The content of the events are XML messages. Internally, this is used for updating the full-text index, notification-mail sending and clearing of remote caches. Logging all JMS events gives a full audit log of all updates that happened to the repository.

No such mechanism as far as I know. But Plone of course offers fully detailed audit logs of any of its events.

* Repository extensions can provide additional services, included are:
o a notification email sender (which also includes the management of the subscriptions), allowing subscribing to individual documents, collections of documents or all documents.

No such generic feature by default in Plone. You can add scripts to send notification in any workflow transition. But you need to write one or two lines of Python. And the management of subscriptions is not implemented by default. But folder-like object support RSS syndication so that you can agregate Plone’s new objects in your favorite news aggregator;

o a navigation tree management component and a publisher component, which plays hand-in-hand with our frontend (see further on)

I’ll see further on… :)

* A JMX console allows some monitoring and maintenance operations, such as optimization or rebuilding of the fulltext index, monitoring memory usage, document cache size, or database connection pool status.

You have several places to look at for this monitoring within Zope/Plone (no centralized monitoring). An additional Plone product helps in centralizing maintenance operations. Still some ground for progress here.

The « Daisywiki » frontend
The frontend is called the « Daisywiki » because, just like wikis, it provides a mixed browsing/editing environment with a low entry barrier. However, it also differs hugely from the original wikis, in that it uses wysiwyg editing, has a powerful navigation component, and inherits all the features of the underlying daisy repository such as different document types and powerful querying.

Well, then we can just say the same for Plone and rename its skins the Plonewiki frontend… Supports Wysiwyg editing too, with customizable navigation tree, etc.

* wysiwyg HTML editing
o supports recent Internet Explorer and Mozilla/Firefox (gecko) browsers, with fallback to a textarea on other browsers. The editor is customized version of HTMLArea (through plugins, not a fork).

Same for Plone (except it is not an extension of HTMLArea but of a similar product).

o We don’t allow for arbitrary HTML, but limit it to a small, structural subset of HTML, so that it’s future-safe, output medium independent, secure and easily transformable. It is possible to have special paragraph types such as ‘note’ or ‘warning’. The stored HTML is always well-formed XML, and nicely layed-out. Thanks to a powerful (server-side) cleanup engine, the stored HTML is exactly the same whether edited with IE or Mozilla, allowing to do source-based diffs.

No such validity control within Plone. In fact, the structure of a Plone document is always valid because it is managed by Plone according to a specific object model. But a given object may contain an HTML part (a document’s body as an example) that may not be valid. If your documents are to have a recurrent inner structure, then you are invited to make this structure an extension of an object class so that is no more handled as a document structure. See what I mean ?

o insertion of images by browsing the repository or upload of new images (images are also stored as documents in the repository, so can also be versioned, have metadata, access control, etc)

Same with Plone except for versioning. Note that Plone’s Photo content type support automatic server-side redimensioning of images.

o easy insertion document links by searching for a document

Sometimes yes, sometimes no. It depends on the type of link you are creating.

o a heartbeat keeps the session alive while editing

I don’t know how it works here.

o an exlusive lock is automatically taken on the document, with an expire time of 15 minutes, and the lock is automatically refreshed by the heartbeat

I never tried the Plone extension for versioning so I can’t say. I know that you can use the WebDAV interface to edit a Plone object with your favorite text processing package if you want. And I suppose this interface properly manages this kind of issues. But I never tried.

o editing screens are built dynamically for the document type of the document being edited.

Of course.

* Version overview page, from which the state of versions can be changed (between published and draft), and diffs can be requested. * Nice version diffs, including highlighting of actual changes in changed lines (ignoring re-wrapping).

You can easily move any object in its associated workflow (from one state to another, through transitions). But no versioning. Note that you can use Plone’s wiki extension and this extension supports supports diffs and some versioning features. But this is not available for any Plone content type.

* Support for includes, i.e. the inclusion of one document in the other (includes are handled recursively).

No.

* Support for embedding queries in pages.

You can use Topics (persistent queries). You can embed them in Content Panels.

* A hierarchical navigation tree manager. As many navigation trees as you want can be created.

One and only one navigation tree by default. But Topics can be nested. So you can have one main navigation tree plus one or more alternatives with Topics (but these alternatives are limited for some reasons.).

Navigation trees are defined as XML and stored in the repository as documents, thus access control (for authoring them, read access is public), versioning etc applies. One navigation tree can import another one. The nodes in the navigation tree can be listed explicitely, but also dynamically inserted using queries. When a navigation tree is generated, the nodes are filtered according to the access control rules for the requesting user. Navigation trees can be requested in « full » or « contextualized », this last one meaning that only the nodes going to a certain document are expanded. The navigtion tree manager produces XML, the visual rendering is up to XSL stylesheets.

This is nice. Plone can not do that easily. But what Plone can do is still done with respect to its security model and access control, of course.

* A navigation tree editor widget allows easy editing of the navigation trees without knowledge of XML. The navigation tree editor works entirely client-side (Mozilla/Firefox and Internet Explorer), without annoying server-side roundtrips to move nodes around, and full undo support.

Yummy.

* Powerful document-publishing engine, supporting:
o processing of includes (works recursive, with detection of recursive includes)
o processing of embedded queries
o document type specific styling (XSLT-based), also works nicely combined with includes, i.e. each included document will be styled with its own stylesheet depending on its document type.

* PDF publishing (using Apache FOP), with all the same features as the HTML publishing, thus also document type specific styling.

Plone document-like content type offer PDF views too.

* search pages:
o fulltext search
o searching using Daisy’s query language
o display of referers (« incoming links »)

Fulltext search is available. No query language for the user. Display of refers is only available for content type that are either wiki pages or have been given the ability to include references from other objects.

* Multiple-site support, allows to have multiple perspectives on top of the same daisy repository. Each site can have a different navigation tree, and is associated with a default collection. Newly created documents are automatically added to this default collection, and searches are limited to this default collection (unless requested otherwise).

It might be possible with Plone but I am not sure when this would be useful.

* XSLT-based skinning, with resuable ‘common’ stylesheets (in most cases you’ll only need to adjust one ‘layout’ xslt, unless you want to customise heavily). Skins are configurable on a per-site basis.

Plone’s skins are using the Zope Page Templates technology. This is a very nice and simple HTML templating technology. Plone’s skins make an extensive use of CSS and in fact most of the layout and look-and-feel of a site is now in CSS objects. These skins are managed as objects, with inheritance, overriding of skins and other sophisticated mechanism to configure them.

* User self-registration (with the possibility to configure which roles are assigned to users after self-registration) and password reminder.

Same is available from Plone.

* Comments can be added to documents.

Available too.

* Internationalization: the whole front-end is localizable through resource bundles.

Idem.

* Management pages for managing:
o the repository schema (the document types)
o the users
o the collections
o access control

Idem.

* The frontend currently doesn’t perform any caching, all pages are published dynamically, since this also depends on the access rights of the current user. For publishing of high-trafic, public (ie all public access as the same user), read-only sites, it is probably best to develop a custom publishing application.

Zope includes caching mechanisms that take care of access rights. For very high-trafic public sites, a Squid frontend is usually recommended.

* Built on top of Apache Cocoon (an XML-oriented web publishing and application framework), using Cocoon Forms, Apples (for stateful flow scenarios), and the repository client API.

By default, Zope uses its own embedded web server. But the usual setup for production-grade sites is to put an Apache reverse-proxy in front of it.

My conclusion : Daisy looks like a nice product when you have a very document-oriented project, with complex documents with structures varying much from documents to documents ; its equivalent in Zope’s world would be Silva. But Plone is much more appropriate for everyday CMS sites. Its object-orientation offers both a great flexibility for the developer and more ease of use for Joe-six-pack webmaster. Plone still lacks some important technical features for its future, namely ReSTful web service interfaces, plus placeless content paradigm. Versioning is expected soon.

This article was written in just one raw, late at night and with no re-reading reviewed once thanks to Gouri. It may be wrong or badly lacking information on some points. So your comments are much welcome !

Semantic Web reports for corporate social responsability

With that amount of buzzwords in the title, I must be ringing some warning bells in your minds. You would be right to get cautious with what I am going to say here because this is pure speculation. I would like to imagine how annual (quarterly ?) corporate reports should look like in some near future.

In my opinion, they should carry on the current trend on emphasizing corporate social responsability. In order to do so, they should both embrace innovative reporting standards and methodologies and support these methodologies by implementing them with « semantic web »-like technologies. In such a future, it would mean that financial analyst (and eventually stakeholders) should be able to browse through specialized web sites which would aggregate meaningful data published in these corporate reports. In such specialized web sites, investors should be able to compare comparable data, marks and ratings regarding their favorite corporations. They should be given functionalities like the one you find in multidimensional analysis tools (business intelligence), even if they are as simplified as in interactive purchase guides [via Fred]. In such a future, I would be able to subscribe to such a web service, give my preferences and filters in financial, social and environmental terms. This service would give me a snapshot of how the selected corporations compare one to each other regarding my preferences and filters. Moreover, I would receive as an RSS feed an alert whenever a new report is published or when some thresholds in performance are reached by the corporations I monitor.

Some technological issues still stand in the way of such future. They are fading away. But a huge amount of methodological and political issues stand there also… What if such technologies come to maturity ? Would they push corporations, rating agencies, analysts and stakeholders to change their minds and go in the right direction ?

Projet Internet de rue: Appel à nos amis roumains !

Y a-t-il des roumains de Roumanie dans la salle ? En France, il y en a dans la rue, qui n’ont pas forcément beaucoup de moyens mais
ont l’audace de goûter aux nouvelles technos. [Je précise après coup qu’il s’agit de roumains gitans car un lecteur roumain s’est senti offensé de l’absence de cette précision, cf. la discussion plus bas.] Ils aimeraient bien communiquer par Internet avec leur famille restée au pays, pour échanger quelques photos du fils que l’on a pas revu depuis deux ans… Problème : au pays, qui pourra donner (prêter ?) à cette famille un accès à l’Internet ? Faites passer cet appel à vos contacts en Roumanie, ça pourrait être sympa. Au passage, découvrez le formidable projet Internet de rue.

Consultant-in-a-box

Je travaille actuellement avec des consultants d’une très grosse SSII indienne en vue de l’externalisation offshore d’un projet informatique. Le consultant avec qui je travaille sur site pratique le yoga depuis de nombreuses années. Il ne sait pas encore entrer dans une boîte, m’a-t-il avoué. Mais le concept d’un « consultant-in-a-box » ou encore d’un « Commercial-Off-The-Shelf » Consultant a de quoi séduire ! Imaginez un peu, vous allez sur votre site de e-business favori, du style consultantinabox.com et là, vous cochez les compétences du consultant indien dont vous avez besoin : un peu de J2EE par ici, un peu d’intégration de middleware de sécurité par ici, etc. Vous validez votre devis en ligne et payez avec votre carte de crédit corporate. Ensuite, UPS vous livre en 24H, par avion, une boîte avec votre consultant yogi dedans. Pas mal, non ? Bien sûr, en fin de mission, le consultant sait se repackager tout seul et retourne à Bangalore pour poursuivre la mission à distance (en mode offshore, ça coûte bien moins cher que sur place).

Trève de plaisanterie, voici quelques notes en vrac glanées lors d’une discussion avec la société Pivolis, spécialisée en accompagnement de sociétés européennes souhaitant travailler avec des indiens (cf. également cet ancien article citant Mr Pivolis, aka Vincent Massol).

Les sociétés d’accompagnement du type Pivolis peuvent assurer un rôle :

logistique : infrastructures (VPN, …), voyages et visites, arrivées et départ (gestion des identités, provisioning des nouveaux entrants dans le projet)
de médiateur/facilitateur/organisateur des communications entre français et indiens
de conseil méthodologique (et outils)

Typiquement, pour un projet représentant 100 équivalent plein temps (50 français et 50 indiens), la charge de coordination par ce type de société pourrait représenter 2 équivalents temps plein (soit 2%). De plus, dans l’équipe française, certaines personnes sont également dédiées à des tâches de support à l’équipe indienne, i.e. à la facilitation du transfert de connaissances. Dans une première phase de projet, cela peut représenter une personne pour 6 indiens. Ensuite, les choses deviennent plus routinières et le ratio se stabilise à un français en support à l’offshore pour 10 indien.

Contrairement à ce à quoi on pourrait s’attendre, il ne faut pas chercher outre mesure à spécialiser les équipes françaises et indiennes. Mieux vaut organiser et entretenir une certaine redondance des rôles sur site et à l’offshore, au moins pendant une (longue) période de transfert de compétences. La communication s’établit alors à de multiples niveaux : de project leader français à project leader indien, de lead developer français à lead developer indien. Cette similarité des rôles et la multiplicité des canaux de communication qui s’en suit sont des facteurs de clefs de succès pour la coopération franco-indienne sur un projet informatique. Le facteur clef de succès, selon Pivolis, c’est de s’efforcer de garder les gens heureux de part et d’autre. Le second principal facteur de succès réside dans la rigueur et la discipline des équipes notamment quant à l’utilisation des outils de coordination (principalement un issue-tracker, un outil de gestion de builds et un wiki).

De multiples outils sont souhaitables pour faire collaborer des équipes distantes. On peut les analyser selon leur fréquence d’utilisation et leur efficacité relative. Concernant leur fréquence, on trouvera par exemple, pour un projet de 100 personnes :

Voyages (de la France vers l’Inde ET vice-versa) : 3 personnes toutes les 6 semaines
Echange de documents bureautiques : rédaction d’un document important par mois
Téléphone : 1,5 conférence téléphonique par semaine et par équipe de 10 personnes
Wiki : 1 modification par jour et par équipe de 10 personnes
Emails : plusieurs par jour et par personne
Instant messaging : continu et omniprésent

On pourrait grosso modo grouper tous ces outils en 4 catégories :

fréquemment utilisés et très adaptés à ces modes de collaboration : issue tracker, mailing list, wiki
fréquemment utilisés mais moins riches en efficacité : mails individuels, instant messaging, conférence téléphonique
plus rarement utilisés mais très riches et efficaces : documentation, voyages sur site
rarement utilisés et relativement inefficaces : video-conférences

Trois différences culturelles entre indiens et français, à prendre en compte pour faciliter la compréhension mutuelle :

les indiens ont un respect très profond de leur interlocuteur, attitude respectueuse qui n’est pas forcément compatible avec la propension française à s’engager dans des combats de coqs, à tenter une révolution française à chaque réunion ; les français peuvent avoir l’impression de travailler avec des personnes manquant de répondant ou de franchise alors que les indiens peuvent avoir l’impression de travailler avec des personnes impolies, négatives ou agressives
les indiens ont une forte culture de l’hospitalité, de l’accueil et des habitudes culinaires particulières (végétariens…) ; les français sont-ils prêts à accueillir leurs interlocuteurs indiens dans leur famille pendant le week-end ou à passer leur week-end complet à leur faire visiter le château de Versailles ? comment manger un vrai repas végétarien dans un resto Sodexho ?
les indiens appréciraient les félicitations et les remerciements formels, les signes de gratitude (ou de bienvenue, tels que des cadeaux) ; les français peuvent trouver cela superflu ou « chi-chi », et passer des gens indifférents voire ingrats

Ces dernières semaines de travail avec des consultants indiens m’ont déjà permis de vérifier quelques-uns de ces conseils ou observations confiés par la société Pivolis. Et je sens qu’il serait plus que judicieux, pour aller plus loin, d’avoir recours aux services de ce type de société pour un accompagnement sérieux sur le long terme.

Unique identifiers get Febrl

When dealing with identity management, one really appreciate having unique identifiers for describing individual identities. Having a common unique identifier available accross a whole information system is a terrific asset for the management of IT security. The problem is that information systems usually don’t have such a global naming convention or these naming conventions are too weak to ensure uniqueness and permanence of these identities. The usual solution is to define a more clever naming convention, to invent a new unique identifier and to associate any individual data with it so that people identities get managed.

But then the deployment of this unique identifiers raises a new problem : how to guarantee that a given data record describing a person is really related to that person you already know with a unique identifier. You have to decide matches and non-matches accross your data.
The art of doing such decisions was called « record linkage » by the biomedical community because this is a common issue in health information systems for example for epidemiological studies.

Therefore this community developped several approaches (deterministic or probabilistic) to record linkage that can also be applied to field of identity management. Febrl is a very nice open source package that implement state-of-the-art methods for record linkage and that may be applied to the deployment of unique identifiers in IT security systems.

From OWL to Plone

I found a working path to transform an OWL ontology into a working Plone content-type. Here is my recipe :

Choose any existing OWL ontology
With Protege equipped with its OWL plugin, create a new project from your OWL file.
Still within Protege, with the help of its UML plugin, convert your OWL-Protege project into a UML classes project. You get an XMI file.
Load this XMI file into an UML project with Poseidon. Save this project under the .zuml Poseidon format.
From poseidon, export your classes a new xmi file. It will be Plone-friendly.
With a text editor, delete some accentuated characters that Poseidon might have added to your file (for example, the Frenchy Poseidon adds a badly accentuated « Modele sans titre » attribute into your XMI) because the next step won’t appreciate them
python Archgenxml.py -o YourProduct yourprojectfile.xmi turns your XMI file into a valid Plone product. Requires Plone and Archetypes (see doc) latest stable version plus ArchgenXML head from the subversion repository.
Launch your Plone instance and install YourProduct as a new product from your Plone control panel. Enjoy YourProduct !
eventually populate it with an appropriate marshaller.

Now you are not far from using Plone as a semantic aggregator.

Identification and naming practices

States have been precursors in building registers of persons. Here are some national practices and projects in civil registration systems, vital statistics and other administrative identity systems :

You are building an international directory of persons, you know that you will record names, surnames, given names. But what does « surname » mean ? Will you be understood when you ask a foreigner his given name ? Here come culture, society and naming practices :

The Wikipedia on family names
Naming in the Kashmiri Pandit Community : Gotras, surnames and nicknames
Latin American Surnames
Norwegian Naming Practices
Vietnamese names
Arabic Naming Practices And Period Names List
Chinese personal names
Campaigns of Opposition to ID Card Schemes, How the UK population considers civil registration (see question « what contribution should civil registration make to proving identity and how ? »
Traduction officielles des noms des éléments constituant l’identité d’une personne dans différents pays européens : prénoms/vornamen/nombre propio/nome/onoma/…, nom/name/appelidos/geslachtsnaam/…

You’ve got existing databases that you want to link to your fresh new directory of persons. But how to build that link ? How to match records coming from different databases when they don’t share a common unique identifier ? This is the art of record linkage :

Google for « retrospective linkage », « retrospective linkages »
Record linkage in an Information Age society, Linking Health Records : Human rights concerns, Record linkage techniques
High performance computing techniques for record linkage, see also their website and febrl, their open source prototype software (powered by Python !)
Research papers by an expert from the U.S. Bureau of the Census : Matching and record linkage, Frequency-based matching in fellegi-sunter, record linkage software and methods for merging administrative lists

Transliteration

I wanted to give Alban and others pointers to my resources on the topic of transliteration. But I can’t find my transliteration documents any more ! Anyway, my experience is that transliteration is a tough problem and after having thought a little bit on this topic, we decided not to automate the transliteration of individual names but to make people input their name according to their own habits in a more or less transliterated form. It would have been great if you were able to automate transliteration (maybe with the help of a virtual Unicode keyboard ?). The main advantage of standardized transliteration is that it is supposed to give you a standardized representation of the name of a person. You might then rely on this standardized naming elements in order to build a unique identifier. But the problem is that many language transliterations are not standardized, plus these standards evolve too much. A greek colleague of mine told me that his name was transliterated many times with many different output (creating problems at the airport, I let you imagine). Transliteration definitely remains as a problem for strong identity management. At the moment, you should just try to workaround it until transliteration standards are more robust and widely adopted…

Anyway, here are some pointers on how to process non-latin documents and maybe transliterate them :

Une voix en vrac

La philosophie des carnets web, c’est d’écrire « avec sa vraie voix« . Gilles (celui qui est en vrac), fait encore mieux : il parle sur son carnet, avec sa vraie voix. Voici donc le premier carnettier francophone (que je connaisse) qui se met au screencasting, grâce au logiciel Camtasia. Du côté anglo-saxon, c’est Jon Udell qui a ouvert la voixe du screencasting. Ne loupez pas l’excellente démonstration de la wikipedia en screencasting par Jon Udell.
Gilles, tu as une voix délicieusement québécoise !

Neutre à positif pour Ecartype ?

Ecartype est une forme originale de carnet web : il s’agit d’un carnet de conseils boursiers. A ajouter à la longue liste des usages innovants des carnets Web. Je suis neutre à positif sur Ecartype tant qu’ils s’inscrivent dans leur triangle haussier en matière d’usages innovants. Et, tant que j’y suis à découvrir un vocable abscons : attention au pull back au niveau de la ligne de cou, ça peut donner le torticolis. A quand un ecartype qui donnent des conseils boursiers au sujet du marché des communautés open source (« je suis neutre à positif sur Drupal », « Plone s’inscrit dans un long triangle haussier », etc.) ?

SOA, tu m’auras pas !

C’est pas moi qui l’ai dit :
SOA : nouvel acronyme, vieux problème.

The CMS pseudo-stock market

The Drupal people produced insightful stock-market-like statistics about the popularity of open source CMS packages (via the precious Amphi-Gouri). But their analysis mixes content management systems (Drupal, Plone) with blog engines (WordPress) and bulletin boards (phpBB). Anyway, it shows that :

« The popularity of most Free and Open Source CMS tools is in an upward trend.«
Bulletin boards like phpBB is the most popular category, maybe the most mature and phpBB is the strong leader in this category
In the CMS category, Mambo, Xoops, Drupal and Plone are direct competitors ; Mambo is ahead in terms of popularity, Plone is behind its PHP competitors which certainly benefit from the popularity of PHP compared to Python; PHP-Nuke and PostNuke are quickly loosing some ground.
WordPress is the most dynamic open source blog engine in terms of growth of popularity ; its community is exploding

My conclusion :

if you want an open source bulletin board/community forum, then choose phpBB with no hesitation
if you want a real content management system and are not religiously opposed to Python, then choose Plone, else stick with PHP and go Mambo (or Xoops ?)
if you want an open source blog engine, then enjoy WordPress

If feel like producing this kind of statistical analysis about the dynamics of open source communities is extremely valuable for organization and people considering several open source options (cf. the activity percentile indicated on sourceforge projets as an example). I would tend to say that the strength of an open source community, measured in term of growth and size, is the one most important criteria to rely on when choosing an open source product.

Nowadays, the (real) stock market relies strongly on rating agencies. There must be a room (and thus a business opportunity) for an open source rating agency that would produce strong evidences about the relative strength of project communities.

What do you think ?

Web scraping with Python (part II)

The first part of this article dealt with retrieving HTML pages from the web with the help of a mechanize-propelled web crawler. Now your HTML pieces are safely saved locally on your hard drive and you want to extract structured data from them. This is part 2, HTML parsing with Python. For this task, I adopted a slightly more imaginative approach than for my crawling hacks. I designed a data extraction technology based on HTML templates. Maybe this could be called « reverse-templating » (or something like template-based reverse-web-engineering).

You may be used with HTML templates for producing HTML pages. An HTML template plus structured data can be transformed into a set of HTML pages with the help of a proper templating engine. One famous technology for HTML templating is called Zope Page Templates (because this kind of templates is used within the Zope application server). ZPTs use a special set of additional HTML tags and attributes referred to by the « tal: » namespace. One advantage of ZPT (over competing technologies) is that ZPT are nicely rendered in WYSIWYG HTML editors. Thus web designers produce HTML mockups of the screens to be generated by the application. Web developpers insert tal: attributes into these HTML mockups so that the templating engine will know which parts of the HTML template have to be replaced by which pieces of data (usually pumped from a database). As an example, web designers will say <title>Camcorder XYZ</title> then web developpers will modify this into <title tal:content= »camcorder_name »>Camcorder XYZ</title> and the templating engine will further produce a <title>Camcorder Canon MV6iMC</title> when it processes the « MV6iMC » record in your database (it replaces the content of the title element with the value of the camcorder_name variable as it is retrieved from the current database record). This technology is used to merge structured data with HTML templates in order to produce Web pages.

I took inspiration from this technology to design parsing templates. The idea here is to reverse the use of HTML templates. In the parsing context, HTML templates are still produced by web developpers but the templating engine is replaced by a parsing engine (known as web_parser.py, see below for the code of this engine). This engine takes HTML pages (the ones you previously crawled and retrieved) plus ZPT-like HTML templates as input. It then outputs structured data. First your crawler saved <title>Camcorder Canon MV6iMC</title>. Then you wrote <title tal:content= »camcorder_name »>Camcorder XYZ</title> into a template file. Eventually the engine will output camcorder_name = « Camcorder Canon MV6iMC ».

In order to trigger the engine, you just have to write a small launch script that defines several setup variables such as :

the URL of your template file,
the list of URLs of the HTML files to be parsed,
whether you would like or not to pre-process these files with an HTML tidying library (this is useful when the engine complains about badly formed HTML),
an arbitrary keyword defining the domain of your parsing operation (may be the name of the web site your HTML files come from),
the charset these HTML files are made with (no automatic detection at the moment, sorry…)
the output format (csv-like file or semantic web document)
an optional separator character or string if ever you chose the csv-like output format

The easiest way to go is to copy and modify my example launch script (parser_dvspot.py) included in the ZIP distribution of this web_parser.

Let’s summarize the main steps to go through :

install utidylib into your python installation
copy and save my modified version of BeautifulSoup into your python libraries directory (usually …/Lib/site-packages)
copy and save my engine (web_parser.py) into your local directory or into you python libraries directory
choose a set of HTML files on your hard drive or directly on a web site,
save one of these files as your template,
edit this template file and insert the required pseudotal attributes (see below for pseudotal instructions, and see the example dvspot template template_dvspot.zpt),
copy and edit my example launch script so that you define the proper setup variables in it (the example parser_dvspot.py contains more detailed instructions than above), save it as my_script.py
launch your script with a python my_script.py > output_file.cowl (or python my_script.py > output_file.cowl)
enjoy yourself and your fresh output_file.owl or output_file.csv (import it within Excel)
give me some feedback about your reverse-templating experience (preferably as a comment on this blog)

This is just my first attempt at building such an engine and I don’t want to make confusion between real (and mature) tal attributes and my pseudo-tal instructions. So I adopted pseudotal as my main namespace. In some future, when the specification of these reverse-templating instructions are somewhat more stabilized (and if ever the « tal » guys agree), I might adopt tal as the namespace. Please also note that the engine is somewhat badly written : the code and internal is rather clumsy. There is much room for future improvement and refactoring.

The current version of this reverse-templating engine now supports the following template attributes/instructions (see source code for further updates and documentation) :

pseudotal:content gives the name of the variable that will contain the content of the current HTML element
pseudotal:replace gives the name of the variable that will contain the entire current HTML element
(NOT SUPPORTED YET) pseudotal:attrs gives the name of the variable that will contain the (specified?) attribute(s ?) of the current HTML element
pseudotal:condition is a list of arguments ; gives the condition(s) that has(ve) to be verified so that the parser is sure that current HTML element is the one looked after. This condition is constructed as a list after BeautifulSoup fetch arguments : a python dictionary giving detailed conditions on the HTML attributes of the current HTML element, some content to be found in the current HTML element, the scope of research for the current HTML element (recursive search or not)
pseudotal:from_anchor gives the name of the pseudotal:anchor that is used in order to build the relative path that leads to the current HTML element ; when no from_anchor is specified, the path used to position the current HTML element is calculted from the root of the HTML file
pseudotal:anchor specifies a name for the current HTML element ; this element can be used by a pseudotal:from_anchor tag as the starting point for building the path to the element specified by pseudotal:from_anchor ; usually used in conjunction with a pseudotal:condition ; the default anchor is the root of the HTML file.
pseudotal:option describes some optional behavior of the HTML parser ; is a list of constants ; contains NOTMANDATORY if the parser should not raise an error when the current element is not found (it does as default) ; contains FULL_CONTENT when data looked after is the whole content of the current HTML element (default is the last part of the content of the current HTML element, i.e. either the last HTML tags or the last string included in the current element)
pseudotal:is_id_part a special ‘id’ variable is automatically built for every parsed resource ; this id variable is made of several parts that are concatenated ; this pseudotal:is_id_part gives the index the current variable will be used at for building the id of the current resource ; usually used in conjunction with pseudotal:content, pseudotal:replace or pseudotal:attrs
(NOT SUPPORTED YET) pseudotal:repeat specifies the scope of the HTML tree that describes ONE resource (useful when several resources are described in one given HTML file such as in a list of items) ; the value of this tag gives the name of a class that will instantiate the parsed resource scope plus the name of a list containing all the parsed resource

The current version of the engine can output structured data either as a CSV-like output (tab-delimited for example) or as an RDF/OWL document (of Semantic-Web fame). Both formats can easily be imported and further processed with Excel. The RDF/OWL format gives you the ability to process it with all the powerful tools that are emerging along the Semantic Web effort. If you feel adventurous, you may thus import your RDF/OWL file into Stanford’s Protege semantic modeling tool (or into Eclipse with its SWEDE plugin) and further process your data with the help of a SWRL rules-based inference engine. The future Semantic Web Rules Language will help at further processing this output so that you can powerfully compare RDF data coming from distinct sources (web sites). In order to be more productive in terms of fancy buzz-words, let’s say that this reverse-templating technology is some sort of a web semantizer. It produces semantically-rich data out of flat web pages.

The current version of the engine makes an extensive use of BeautifulSoup. Maybe it should have been based on a more XMLish approach instead (using XML pathes ?). But it would have implied that the HTML templates and HTML files to be processed should then have been turned into XHTML. The problem is that I would then have relied on utidylib but this library breaks too much some mal-formed HTML pages so that they are not valuable anymore.

Current known limitation : there is currently no way to properly handle some situations where you need to make the difference between two similar anchors. In some cases, two HTML elements that you want to use as distinct anchors have in fact exactly the same attributes and content. This is not a problem as long as these two anchors are always positioned at the same place in all the HTML page that you will parse. But, as soon as one of the anchors is not mandatory or it is located after a non mandatory element, the engine can get lost and either confuse the two anchors or complain that one is missing. At the moment, I don’t know how to handle this kind of situation. Example : long lists of specifications with similar names where some specifications are optional (see canon camcorders as an example : difference between lcd number of pixels and viewfinder number of pixels). The worst case scenario would be when there is a flat list of HTML paragraphs. The engine will try to identify these risks and should output some warnings in this kind of situations.

Here are the contents of the ZIP distribution of this project (distributed under the General Public License) :

web_parser.py : this is the web parser engine.
parser_dvspot.py : this is an example launch script to be used if you want to parser HTML files coming from the dvspot.com web site.
template_dvspot.zpt : this is the example template file corresponding to the excellent dvspot.com site
BeautifulSoup.py : this is MY version of BeautifulSoup. Indeed, I had to modify Leonard Richardson’s official one and I couldn’t obtain any answer from him at the moment regarding my suggested modifications. I hope he will soon answer me and maybe include my modifications in the official version or help me overcoming my temptation to fork. My modifications are based on the official 1.2 release of beautifulsoup : I added « center » as a nestable tag and added the ability to match the content of an element with the help of wildcards. You should save this BeautifulSoup.py file into the « Lib\site-packages » folder of your python installation.
README.html is the file you are currently reading, also published on my blog.

P2P + Web Sémantique + Réseaux sociaux + Bureautique = ?

Prenez une once de peer-to-peer, trois coudées de web sémantique, deux livres de bureautique et un denier de réseau sociaux, malaxez avec énergie et vous obtenez… le « Networked Semantic Desktop ». Ca c’est de la convergence où je ne m’y connais pas… Projet de recherche, circulez, il n’y a rien à télécharger ! Vu également ici.

Jointure d’identité et réseaux sociaux

IBM a récemment acquis SRD, éditeur d’un logiciel qui met en correspondance :

diverses descriptions informatiques d’un même individu (jointure avancée d’identités),
les relations établies entre individus d’après leurs points communs (établissements de réseaux sociaux)

Bref, un logiciel tout à fait adéquat pour qui veut se constituer la parfaite panoplie du petit big brother.
Cette technologie semble un peu similaire aux méta-annuaires qui se sont spécialisés dans la jointure simple d’identités électroniques. Mais elle semble trouver ses principales applications dans le domaine du data mining pour le marketing et la CRM, dans l’analyse du risque crédit et dans le renseignement d’Etat et la sécurité privée. Elle me fait penser à Semagix qui, tout en ciblant des champs d’application similaires (CRM, détection d’opérations financières frauduleuses et déconstruction de réseaux sociaux mafieux), mise sur une approche « web sémantique ».
Il serait sans doute intéressant de savoir sur quelle approche technologique s’appuie SRD et si cette techno est suffisamment robuste, fiable et automatisée pour apporter des éléments de solution dans des problématiques très opérationnelles de gestion des identités électroniques.

Mesurer la qualité du Web tout en surfant

La conformité aux standards ouverts définis dans le cadre du W3C est un élément important dans la qualité du Web tel qu’on le connaît. Pour ceux d’entre vous qui sont développeurs Web et souhaitent vérifier la conformité des pages qu’ils consultent, rien de plus simple : avec l’extension HTML Validator de Firefox, une petite icône en bas de Firefox vous indique si la page pose ou non problème. Mieux encore, l’extension vous indique comment corriger le problème si vous pouvez modifier cette page !

Web-SSO : A CAS client for Zope

The Central Authentication Service (aka CAS) is an open source lightweight framework that provides Web Single Sign On to big organizations (universities, agencies, corporations). It seems to be wildly used and seen as as much mature and reliable as the struts framework.

An existing server can benefit CAS WebSSO features if its technology is supported by a CAS client. So, please welcome Zope’s CAS User Folder, that SSOizes Zope within complex SSO infrastructures.

Le Gartner consacre blogs et wikis

Le Gartner Group reconnait dans les wikis, les blogs, les logiciels de réseautage social et RSS un fort potentiel pour l’entreprise. L’attention portée par le cabinet d’analyse au mouvement de la gestion des connaissances « grass-roots » contribuer à apporter à celui-ci la légitimité (la consécration ?) qui lui permettront de prendre pied dans le secteur privé.

Depuis quelques mois, je sentais le vent venir : mon chef me parle de plus en plus souvent blogs et RSS (« c’est quoi ? », « à quoi ça sert ? », « comment je pourrais essayer ? »). Au début, c’était peut-être un peu pour me faire plaisir ? Mais, non, il a même souhaité que je lui installe un agrégateur RSS sur son poste de travail. Ah ! Du concret ! Ajouté à cela tout le buzzwording du Gartner et autres MetaGroup sur le sujet (« blogs et wiki sont les outils de collaboration de troisième génération »), on peut dire que, ça y est, les grandes entreprises portent l’attention de leur informatique sur ce sujet (il était temps). Maintenant, il faudra encore attendre un peu avant de voir des usages prendre racines à l’échelle de l’entreprise entière. En attendant, carnettons et agrégeons tous en coeur !

Zemantic: a Zope Semantic Web Catalog

Zemantic is an RDF module for Zope (read its announcement). From what I read (not tested by me yet), it implements services similar to zope catalogs and enables universal management of references (such as the Archetypes reference engine but in a more sustainable way). It is based on RDFLib, similarly to ROPE.

I feel enthusiastic about this product since it sounds to me like a good future-proof solution for the management of metadata, references and structured data within content management systems and portals. Plus Zemantic would sit well in my vision of Plone as a semantic aggregator.

Modernisation des processus de gestion des identités électroniques de Saint-Gobain

[Ceci est le résumé de l’une de mes réalisations professionnelles. Je m’en sers pour faire ma pub dans l’espoir de séduire de futurs partenaires. Plus d’infos à ce sujet dans le récit de mon parcours professionnel.]

En 2000, l’informatique de Saint-Gobain est tellement décentralisée et hétérogène que la sécurité centrale ignore qui fait ou non partie du groupe. La direction des systèmes d’information me charge de diriger la modernisation des identités électroniques des 200 000 personnes du Groupe. J’analyse les processus clefs à moderniser : arrivée et départ des personnes. Je conduis le changement avec les acteurs concernés (RH, informatique et moyens généraux). Je dirige la conception et le déploiement d’une infrastructure technique de gestion des identités. En 2005, 50 000 des 200 000 personnes sont immatriculées et disposent d’un identifiant informatique unique. L’identité de 30 000 d’entre elles et leur appartenance au Groupe est tenue à jour en temps réel dans l’Annuaire Groupe. La suite du déploiement est planifiée pour les trois ans à venir.

Jean Millerat's bytes for good

Innover, Servir, Entreprendre !