What is the goal of PE2?

Our ultimate goal was to use the entire RSS feed of 2014 taken from and use linguistic annotation programs to identify and extract word patterns which we then illustrate graphically. This last step allows us to see the relation between words and which ones appear together.

parcours-BAO

What is RSS?

RSS-icon Although this is a familiar icon to many, few are able to explain what it represents. RSS, which is the acronym for ‘Really Simple Syndication ’ or ‘Rich Site Summary ’, as it is more comprehensible from the latter extension, is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it.

There are two kinds of users for RSS. For the casual users, it is a means of getting automatic updates of a website's content without having to search through it manually; It is generally done by special tools like RSS reader or RSS aggregator. In the other hand there are webmasters for whom RSS is used to syndicate their needed content for their website.


Corpus

As mentioned earlier, our input data are the RSS feeds of 2014 downloadable from the course website. The files are arranged in a folder hierarchy sorted in days and months. They are organized into sections, each containing the titles and articles descriptions which are what we are going to focus on.

                
<rss xmlns:itunes="" version="2.0"><channel><atom:link href="http://www.lemonde.fr/international/rss_full.xml" rel="self" type="application/rss+xml" /> <title>International : Toute l'actualité sur Le Monde.fr.</title><link>http://www.lemonde.fr/international/rss_full.xml</link> <description>International - Découvrez gratuitement tous les articles, les vidéos et les infographies de la rubrique International sur Le Monde.fr.</description> <language>en</language><copyright>Copyright Le Monde.fr</copyright><pubDate>Fri, 06 Jun 2014 16:45:58 GMT</pubDate><lastBuildDate>Fri, 06 Jun 2014 16:45:58 GMT</lastBuildDate><ttl>2</ttl><image> <title>International : Toute l'actualité sur Le Monde.fr.</title> <url>http://www.lemonde.fr</url><link>http://www.lemonde.fr/international/rss_full.xml</link></image> <item><title>Yémen : 60 migrants africains se noient au large des côtes</title><link>http://www.lemonde.fr/proche-orient/arti</link> <description>Soixante migrants somaliens et éthiopiens, ainsi que deux Yéménites membres de l'équipage, se sont noyés le 31 mai au large des côtes du Yémen, </description>

Tools

We used a variety of resources throughout our project. Here is a brief description of each of them.

Perl

PERL, or Practical Extraction and Report Language is a programming language created by Larry Wall in 1987. It is quite well suited for operating on textual files and has many built-in functions to this effect. (More information of its official website).

When we started working on this project, none of us have not experienced coding in Perl. You may find out which references we have used and what kind of problems we have confronted with during our implementation

TreeTagger

TreeTagger is a an annotation program developed by Helmut Schmid. It identifies the part of speech and lemma for each word of a given text. For more information there is the official website

Cordial

Coridal provides the lemma and morpho-syntactic category for each word of a given text. For more information there is the official website

Patron2graphe

This program provides graphical representations of the relations between words that have previously been annotated by programs such as TreeTagger or Cordial. More information can be found on the webpage.