Archives pour la catégorie My wishlist

Web scraping with Python

Here is a set of resources for scraping the web with the help of Python. The best solution seems to be Mechanize plus Beautiful Soup.

See also :

Off-topic : proxomitron looks like a nice (python-friendly ?) filtering proxy.

Plone as a semantic aggregator

Here is an output of my imagination (no code, sorry, just a speech) : what if a CMS such as Plone could be turned into a universal content aggregator. It would become able to retrieve any properly packaged content/data from the Web and import it so that it can be reused, enhanced, and processed with the help of Plone content management features. As a universal content aggregator, it would be able to « import » (or « aggregate ») any content whatever its structure and semantic may be. Buzzwords ahead : Plone would be a schema-agnostic aggregator. It would be a semantic-enabled aggretor

Example : On site A, beer-lovers gather. Site A’s webmaster has setup a specific data schema for the description of beers, beer flabours, beer makers, beer drinkers, and so on. Since site A is rich in terms of content and its community of users is enthusiastic, plenty of beers have been described there. Then site B, powered by a semantic aggregator (and CMS), is interested in any data regarding beverages and beverages impact on human’s health. So site B retrieves beer data from site A. In fact it retrieves both the description of beer1, beer2, beerdrinker1, … and the description of what a beer is, how data is structured when it describes a beer, what the relationship is between a beer and a beer drinker. So site B now knows many things about beer in general (data structure = schema) and many beers specifically (beers data). All this beer data on site B is presented and handled as specific content types. Site B’s users are now able to handle beer descriptions as content items, to process them through workflows, to rate them, to blog on them, and so on. And finallly to republish site B’s own output in such a way it can be aggregated again from other sites. That would be the definitive birth of the semantic web !

There are many news aggregators (RSSBandit, …) that know how to retrieve news items from remote sites. But they are only able to aggregate news data. They only know one possible schema for retrievable data : the structure of a news item (a title + a link + a description + a date + …). This schema is specified in the (many) RSS standard(s).

But now that CMS such as Plone are equipped with schema management engines (called « Archetypes » for Plone), they are able to learn new data schema specified in XML files. Currently, Plone’s archetypes is able to import any schema specified in the form of an XMI file output by any UML modelizing editor.

But XMI files are not that common on the Web. And the W3C published some information showing that any UML schema (class diagram I mean) is the equivalent of an RDF-S schema. And there even is a testbed converter from RDF-S to XMI. And there even are web directories inventoring existing RDF schemas as RDF-S files. Plus RSS 1.0 is based on RDF. Plus Atom designers designed it in such a way it is easily converted to RDF.

So here is my easy speech (no code) : let’s build an RDF aggregator product from Plone. This product would retrieve any RDF file from any web site. (It would store it in the Plone’s triplestore called ROPE for instance). It would then retrieve the associated RDF-S file (and store it in the same triplestore). It would convert it to an XMI file and import it as an Archetypes content type with the help of the ArchGenXML feature. Then it would import the RDF data as AT items conforming to the newly created AT content type. Here is a diagram summarizing this : Plone as a semantic aggregator

By the way, Gillou (from Ingeniweb) did not wait for my imagination output to propose a similar project. He called it ATXChange. The only differences I see between his proposal and what is said above are, first, that Gillou might not be aware about RDF and RDF-S capabilities (so he might end with a Archetypes-specific aggregator inputting and outputting content to and from Plone sites only) and that Gillou must be able to provide code sooner or later whereas I may not be !

Last but not least : wordpress is somewhat going in the same direction. The semweb community is manifesting some interest in WP structured blogging features. And some plugins are appearing that try to incorporate more RDF features in WP (see also seeAlso).

Experimental programs to build some treemaps from graphs

Here is the zipfile containing the treemap programs I mentioned earlier.
And these are zipped set of screenshots produced by these programs : « typic » screenshots, smoothed « typic » screenshots, the « try » graph screenshots, the smoothed « try » graph screenshots. Each set of screenshots corresponds to a specific graph. The « typic » graph is made with 8 nodes grouped in two tightly linked subsets (A-B-C-D and E-F-G-H). The « smoothed typic » graph is the typic graph once the weights of the arcs have been smoothed by a specific smoothing algorithm (ask if you want to know more). The « try » graph is another simple graph with nodes representing some concepts related to me (« family », « video », …) and linked one with another according to their analogy. And the « smoothed try » graph is… guess what.

Treemaps

Here
is some bloggy information about treemaps. Treemaps are cool when you want to visualize a weighted tree such as the tree of folders and files on your hard drive weighted according to their size. An excellent free utility to visualize your disk space occupation is Sequoia View.

Beyond Sequoia View, why such an interest from me for treemaps ? Well… Because I just realized I sort of reinvented and explored this concept several years ago without knowing the proper term. Treemaps… My experiments dealt with the use of treemaps for the visualization of graphs of information (networks composed with nodes and arcs).

For instance, let’s take the following graph composed with 8 nodes labelled from A to H. a typic graph with 8 nodes

When you « sit » on node A and try to look through the arcs to the rest of the graph, what can you see ? Answer : the treemap below. (each node is associated to a specific color)
the classic treemap of the typic graph

What if you consider that the whole space (rectangle) of a given node should be separated in subrectangles for the associated nodes ? Then your treemap becomes somewhat fractal and you get this kind of visualization for the same 8 nodes graph seen from node A :
the treemap of the typic graph

Or maybe you prefer the circle version of this treemap which may look more readable. Here it is with a limited depth (exploring no more than 3 arcs from the starting node) :
circle version of the treemap for the typic graph
But it looks even nicer if you explore a high number of arcs :
hi-depth version of the circle treemap

I will soon post the programs that produce these treemaps and some more screenshots.

WordPress is going semantic (a little bit)…

WordPress, the famous weblog engine (powering this site), is getting equipped (in its CVS HEAD version) with a new feature allowing webloggers to post small pieces of metadata (pairs of key + value) with each one of their blog entries. WordPress is going the same way Charles Nepote went with his semantic wiki prototype. We won’t wait a long time before someone comes with a real semantic bloki. It must be a matter of months.

By the way, Archetypes is a new masterpiece of Plone and its references management engine allows the weaving of semantic relationships inbetween Content objects. It just lacks the ability to publish its schema and data as RDF files through Plone URLs… Anyway, Archetypes should soon provide the ability to extend objects schemas at runtime through the web. It means users will be able to add metadata to objects. These features can already be tested with PloneCollectorNG in its latest version (test the CVS version if you can).

« God writes straight with crooked lines »

God is a pattern of a fractal reality. This is the reason why« God writes straight with crooked lines » Forget the « God » part of this proverb if you’re not comfortable with it. The idea I want to present here is that we poor human beings struggle a lot with our brain in order to make sense out of reality. As an example, we are facing a phenomenon (the NASDAQ is up x points) and trying to identify a trend in this phenomenon (will it last long ?). Day one, reality goes this direction. The day after, the trend seems to be opposite. The reality lines are much crooked.

Reality is such a mess

You can sometimes modelize reality in order to identify some underlying more global trends. Long-term trends mix with shorter-term trends. Long-term economic cycles mix with shorter-term cycles. And the result is rather chaotic series of economic indicators. The mix of scales makes such a mess.

Is this mess chaotic ? Or is there a more complex order behind it ? Unfortunately, science does not always give definite answers to this question. And it is most often a question of faith ! What makes me comfortable with the idea that there must be an hidden order behind the chaos of reality is when I see some sort of fractal patterns in reality. And these patterns are quite common when dealing with complex systems (living systems, social networks, …). If these patterns are really fractal, it provides some interesting properties to reality. It first implies that just by looking at micro-patterns in reality, you will see the macro-patterns also : micro-patterns and macro-patterns are similar (this is the definition of fractality). Theologically, it is an attractive idea : one says that God does not sit on a cloud but resides everywhere in reality. If God is a pattern in a fractal reality, then it means that you are in relation with God in your everyday life (micro-level). You don’t need any spiritual elevator in order to reach a bigger scale God since He is the same (or at least similar enough to believe He is unique) at all scales.

Furthermore and somewhat paradoxically, if God is pattern in a fractal reality, it also means that you may never be able to predict the global long-term trend that He may have defined. Let’s admit the crooked lines above result from the mixing of several cosine curves applied one on top of the others in a fractal logic (see attached Excel file as an example : you can make the scale vary by modifying the yellow underlined numbers).

Underlying trends are revealed

But let’s suppose also that you don’t have the mathematic formulae allowing the calculation of this curve. Then the question is : what is the long-term trend that can be predicted ? If I identify a global trend (« Eureka ! We are sitting on a global nearly cosine curve ! » i.e. the dark purple curve on the above diagram), isn’t there a more global trend that will, on a longer-term predict future results opposite to the results predicted by the more local trend (« Maybe the global cosine curve I identifed is just a minor variation of a longer-term larger cosine curve… ») ? Most often, I think there must such more global trends that escape our ability to modelize reality. Therefore, it reminds me that we should always be humble in our struggle with reality trying to make sense out of it. « God’s Ways Are Higher Than Our Way » It has something to do with modelling epistemology too : a given model or theory should never claim to be an absolute truth. A model is just a way to rationally handle a given set of problems. But may your set of problems be extended, then you may have to deeply and humbly revise your model.

Items for a wishlist

What would be the programs I wish would be developped for me if ever I had a magic wand ?

  • A semantic cache-based aggregator (see that) which would act as a sort of P2P metadata sharing system based on REST-like web services (instead of an ad hoc P2P protocol)
  • A semantic Bloki
  • A contact management product for Zope
  • A e-machine that would allow users to quote priorities so that it sorts them democratically, the votes would be published on the weblogs of each user as a semantic feed (some specific RDF file), and the semantic cache-based aggregator would retrieve them, merge them and publish the sorted list of priorities for a given community or group ; more than priorities, this would ease the building of distributed reputation management systems
  • A full web rewriting of my concept visualization hack

Cartographie de concepts

Certaines grandes entreprises s’intéressent à la cartographie de concepts… sujet qui commence à être à la mode apparemment mais reste en pratique bien souvent du domaine du gadget avec une tendance à l’évolution vers un outil de travail sophistiqué pour analystes professionnels de l’information (mais l’intérêt pratique et la praticité elle-même restent à démontrer). Toujours est-il que pour en savoir plus, outre les liens que j’avais déjà mentionnés, il peut être bon d’aller faire un tour du côté des outils froids du Web, qui consacre un « billet » récent à ce sujet.
Et puis, tant que j’y suis, je vous mets ci-joint des joujoux personnels. L’image ci-dessous est un exemple de cartographie 2D, sans légende, d’un graphe (de concepts, de carottes, de choux, de triplets, de ce que vous voulez…). Les « montagnes » (zones claires) indiquent des régions du graphe qui sont fortement connectées les unes aux autres. Les « vallées » (zones foncées) indiquent des ensembles de liens qui établissent des ponts entre des paquets de noeuds denses mais distants les uns des autres. Cette image est produite par un logiciel que j’ai développé (il y a longtemps) sous Delphi. Le graphe est spécifié sous la forme d’une base de données Paradox (une table pour décrire les noeuds, une autre pour les liens). Et le fonctionnement du logiciel est le suivant : à partir d’un graphe, il produit une cartographie 2D et permet de localiser chaque noeud du graphe sur cette cartographie par un simple survol de votre souris. En effet, lorsque vous déplacez votre souris sur la carto 2D, dans l’interface du logiciel, celui-ci vous indique quels sont les noeuds présents dans la région que vous survolez.
Seul problème : le logiciel n’est pas packagé ; donc pour le faire fonctionner, il faut installer le Borland Database Engine et configurer un alias de base de données dont je ne me souviens plus ; en pratique cela signifie qu’on ne peut le faire fonctionner que si on a Delphi et qu’on sait regarder le source sous Delphi pour configurer DBE comme il faut. Donc je vous prie de trouver le source de ce logiciel intitulé « visurezo » dans le ZIP ci-joint, le tout distribué sous licence GPL.
L’algorithme que j’avais inventé et implémenté dans ce logiciel est connu ailleurs sous le terme de représentation en « treemap ». Cette idée de treemap a été également inventée par ailleurs par d’autres que moi (implémenté notamment dans le très utile utilitaire Sequoiaview). Il s’appuie également sur une implémentation non orthodoxe d’un algorithme classique de lissage elliptique. Le tout agrémenté d’un calcul de densité de frontières, avec un peu d’astuce, d’espièglerie (c’est la vie de … ?). Je résume l’algorithme général : à partir d’un graphe quelconque, on le cartographie en 2D par une treemap, on représente la densité de frontière entre unités élémentaires de la treemap, on applique un lissage elliptique et hop, le tour est joué on obtient une carte 3D projeté sur 2D et représentant les paquets denses de noeuds du graphe. Vous pigez ? Non, c’est pas grave… Ca m’a fait plaisir d’étaler ma pseudo-science. :)
Donc si cette description et l’image à côté vous intéresse, si vous avez une version de Delphi sous la main et que vous avez envie d’en savoir plus, n’hésitez pas à me contacter pour des informations supplémentaires, si besoin. Et faites-moi savoir si jamais vous arrivez à faire tourner le bidule, ça me fera plaisir !
Exemple de cartographie 2D d'un graphe

Semantic Wiki

This is an attempt to translate this other post into English
Charles Nepote, as he notified me, builta prototype of a wiki engine based on some of the semantic web technologies and providing some somewhat « semantic » features. I appreciated this prototype a lot. It lets me imagine how the wiki of the future would look like. Here are some pieces of a dream on this topic. The wikis of tomorrow would…

  • …provide the semantic features of Charles Nepote’s prototype : complying to the REST style, one URL for every page, publishing both HTML and RDF representations of this page (it would be better to provide RDF within XHTML rather than beside, would’n it ?)
  • … be Blikis (also known as Wikilogs) with pingback and/or trackback and so on
  • …implement syndication of recent changes ; therefore they should produce distinct URLs for every update (every « diff » : as an example with URL like http://www.monsite.toto/2004/01/01/12/WikiPage for the 12th modification of the WikiPage, on 1st January 2004, whereas the URL of the WikiPage still remains http://www.monsite.toto/WikiPage).
  • « wikify » remote semantic data ; thus the page http://www.monsite.toto/DcTitle would contain a RDF statement (an OWL one ?) that would mean that this URL is the equivalent to the dc:title property of the Dublin Core
  • …allow users to easily produce metadata with the help of an extension of the Wiki syntax ; as an example, when a WikiPage contains the statement « DcKeywords::MyKeyword », the wiki engine automatically adds the RDF triplet with subject being the page itself (its URL), predicate being the URL of the « keywords » property as defined by the Dublin Core and object being the URL http://www.monsite.toto/MyKeyword.
  • …have search engine allowing users to explore the Wiki with a navigation mode similar to sourceforge’s Trove Map based on the semantic data published by the Wiki ; as an example, the user will be able to display all the pages that are related to MyKeyword and are written in French (because they contain the RDF statements equivalent to the following explicit statement made by users within the page : DcKeyword::MyKeyword and IsWritten::InFrench)
  • …have search engines allowing users to save their queries and explorations as agents (with their own WikiNames) so that other users can browse the same path of exploration as the user who defined the agent

I implemented these two last features as a draft of micro-application pompously called RDFNavigator for Zope, and base don the RDFGrabber product. It is very drafty so it is very incomplete, not documented, instable (because RDFGrabber is not particularly stable itself in my environement), so it may be difficult to make it run. Nevertheless, if someone dares trying it, I am eager to hear comments and critics ! :) By the way, I hope some day to make a more honorable product based on the integration of rdflib into Zope, i.e. based on Rope unless the Ticle project quickly provides a similar result within Plone.

Wiki Sémantique

Charles Nepote, ainsi qu’il me l’avait signalé, a réalisé un prototype de moteur de moteur de Wiki s’appuyant sur certaines des technologies du Web Sémantique et donc implémentant certaines fonctionnalités à caractère « sémantiques ». J’ai trouvé ce prototype particulièrement intéressant. Il me laisse entrevoir ce que pourraient être les wiki du futur. Voici quelques tranches de rêves à ce sujet. Les wiki du futur…

  • …auront les fonctionnalités sémantiques du prototype de Charles Nepote : conformément au style REST, une URL pour chaque page associant représentation HTML et représentation RDF (il serait d’ailleurs préférable d’avoir le RDF dans du XHTML plutôt que à côté, dans une autre URL, non ?)
  • … seront des Blikis (également appelés Wikilogs) avec pingback et/ou trackback et tout le tintouin
  • …implémenteront la syndication des changements récents et, pour cela, produiront des URL distinctes pour chaque mise à jour de page (chaque « diff » : par exemple avec des URL du type http://www.monsite.toto/2004/01/01/12/WikiPage pour la douzième modification de la WikiPage, le premier janvier 2004, alors que l’URL de la WikiPage reste http://www.monsite.toto/WikiPage).
  • …permettront de « wikifier » des données sémantiques distantes ; ainsi la page http://www.monsite.toto/DcTitle contiendra une donnée RDF (ou plutôt OWL ?) qui dira que cette URL est équivalente à celle de la propriété dc:title du Dublin Core
  • …permettront aux utilisateurs de produire simplement des méta-données grâce à une extension de la syntaxe Wiki ; par exemple, lorsqu’une WikiPage contiendra la phrase « DcKeywords::MonMotClef », le moteur wiki lui ajoutera automatiquement le triplet RDF ayant pour sujet la page elle-même (son URL), pour prédicat l’URL de la propriété keywords définie par le dublin core et pour objet l’URL http://www.monsite.toto/MonMotClef.
  • …disposeront de moteurs de recherche internes permettant de les explorer avec le même mode de navigation que la Trove Map de sourceforge rien qu’en s’appuyant sur les donnéees sémantiques produites par le Wiki ; par exemple, l’utilisateur pourra afficher toutes les pages relatives à MonMotClef en ajoutant un filtre de manière à ce que n’apparaissent que les pages qui sont écrites en français (car elles contiennent quelque chose du genre IsWritten::InFrench)
  • …disposeront de moteurs de recherche permettant à l’utilisateur d’enregistrer l’une de ses requêtes sous la forme d’un agent ayant son propre WikiName de manière à ce que d’autres utilisateurs puissent parcourir le même chemin d’exploration que l’utilisateur ayant défini l’agent

J’ai implémenté ces deux dernières fonctionnalités sous la forme d’un brouillon de micro-application pompeusement intitulé RDFNavigator pour l’environnement Zope, en me basant sur le produit RDFGrabber. C’est très brouillonesque donc très incomplet, non documenté, instable (car RDFGrabber n’est pas lui-même exemplaire en la matière) et donc difficile à faire tourner. Mais bon, si quelqu’un a le courage d’essayer, je suis friand de commentaires et critiques ! :) Par ailleurs, j’ai l’espoir un jour d’en faire un produit plus honorable en me basant sur l’intégration de rdflib dans Zope, c’est-à-dire sur Rope à moins que le projet Ticle ne fournisse rapidement un résultat similaire sous Plone.

Plone-ing for the semantic web

Here is a little set of inconsistent slides about the future (Semantic Web) and the present (Plone) and how you can tie one with another. In a few words : there seems to be need for a universal model for knowledge/RDF caches ; the production/transformation of knowledge and content should go through a workflow ; Plone should ease the implementation of such a workflow. It’s all about some link between knowledge management and content management.
These slides are displayed below but are also available as a Powerpoint presentation.




Un trucmuche

Voici une idée de truc à inventer :

  • – ça sert à qui ? à une association loi 1901
  • – ça agit sur quoi ? sur la visibilité de sa valeur ajoutée (sociale, économique, environnementale, …)
  • – ça permet quoi ? accroître ses ressources humaines (bénévoles) et économiques (subventions, dons, …)

Le contexte =

  1. un sponsor, qui veut renforcer son image de marque auprès de ses consommateurs pour conquérir/maintenir des parts de marché,
  2. un politicien, qui veut renforcer son image de marque auprès de ses électeurs pour gagner des voix,
  3. un mécène, qui veut se faire plaisir, avoir bonne conscience, ressentir des émotions
  4. une institution, qui veut justifier son existence, affirmer sa vocation,
  5. Des pistes de réflexion :

    • 1+2 => nécessite une médiatisation => nécessite de produire des récits à raconter à des journalistes.
    • 3 => nécessite une mise en scène, un représentant relationnel et affectif, un drapeau, un symbole, un personnage à investir émotionnellement
    • 4 => donne de la reconnaissance a posteriori => assure la pérennité