Toolbox II

In this section, we will analyse two annotating tools that will tag the titles and descriptions that we extracted earlier from the RSS of leMonde.

Brief description of the work to be done for this phase:


  1. Tagging the complete concatenation of texts extracted in the first toolbox using Cordial. Cordial tagging has 3 columns (form, lemma, category), the first one has a list of tokens, the second has lemmas and the last ahs the morph-syntactic tags. These tags include grammatical categories, gender, number and the person for the verbs. An example of the output is as follows:
     
    					
    Après après PREP vingt vingt ADJNUM ans an NCMP passés passer VPARPMP à à PREP le le DETDMS Premiers premier ADJMP clients client NCMP
  2. In a similar way, using TreeTagger in order to tag the texts that we obtained with BaO1.
     
    					
    Retrouvez[retrouver|VER:pres] l[L|NUM] '['|PUN] ensemble[ensemble|ADV] des[du|PRP:det] dépêches[dépêche|NOM] sur[sur|PRP] http[http|NOM] :[:|PUN] /[/|PUN] /[/|PUN] www[www|NOM] .[.|SENT] lemonde[lemonde|NOM] .[.|SENT] fr[fr|NOM]

Cordial:


Cordial is a trademark of Synapse Développement, a French company created in 1994 that develops applications for content valuation and analysis of textual data. It takes in input files encoded in ISO-8859-1. In order to have efficient tagging via Cordial, it is important to have the following settings, as shown in the figure below (for this, one has to mark and unmark some options in order to get what is displayed):

RSS-icon

The tagging process requires these settings for every file that we analyse. We therefore have to apply these settings every time we treat (analyse) a new text file.

Here is a sample of the final output of Cordial for the section "Voyage":

cordial exemple

Download the complete output for this section.

TreeTagger:


According to what we can find on its official website:

The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Latin, Estonian, Polish and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.
Download TreeTagger here.

For PEII, we use TreeTagger in two phases. In the first phase we try to use TreeTagger in a manual way to convert the content of the .txt output into the POS tags. This phase needs some simple initial setting which are described below. The latter is used in Toolbox

In the version of TreeTagger used in this phase, the tagging has been done by using the Cygwin interface without a specific interface. All what is needed to make it work is explained in its manual in detail (here is the installation hints for French UTF-8), but we had a small challenge in using it for our encoding. The problem could be solved by adding "french-utf8.par" to langues-treetagger folder. Maybe it could be useful saying that we used these two commands as well for:

  1. Segmentation: perl5.20.1 tokenise-utf8.pl What_To_Be_Segmented > The_Output_File.txt
  2. TreeTagger: perl5.20.1 tokenise-utf8.pl What_To_Be_Tagged | ../bin/tree-tagger.exe ../langues-treetagger/french-utf8.par -lemma -token -no-unknown -sgml > The_Output_File.txt

Note that the directory addresses may vary according to your file system. In addition, if the version of the Perl you are using is not the 5th like the ours, you may ignore 5.20.1 in your commands. However, the volume of the files made us use the TreeTagger directly inside the Perl code which is explained below by using this command : perl5.20.1 bao2RSS.pl 2014 3208.

The codes


# After detecting the concerning files and file streams. # This part of the codes is already implemented in the first Toolbox. &directoryBrowsing($rep); foreach my $rub (@liste_rubriques) { my $output1=$sortie.$rub.".xml"; my $output3 = $resBAO2.$rub."_tagged.xml"; if (!open (OUTXML,">>:encoding(utf-8)", $output1)) { die "Can't open file $output1"}; if (!open (OUTTAG,">>:encoding(utf-8)", $output3)) { die "Can't open file $output3"}; print OUTXML "</EXTRACTION>\n"; print OUTTAG "</EXTRACTION>\n"; close(OUTXML); close(OUTTAG); } exit; # It has been already defined in the first Toolbox. Here is the completed version. sub directoryBrowsing { my $path = shift(@_); opendir(DIR, $path) or die "Can't open $path: $!\n"; my @files = readdir(DIR); closedir(DIR); foreach my $file (@files) { next if $file =~ /^\.\.?$/; $file = $path."/".$file; if (-d $file) { &directoryBrowsing($file); } if (-f $file) { if (($file=~/$reper.+\.xml$/) && ($file!~/\/fil.+\.xml$/)) { print "$file\n"; open(FILE, $file); print "Traitement de : $file\n"; my $texte=""; while (my $ligne=<FILE>) { $ligne =~ s/\n//g; $ligne =~ s/\r//g; $texte .= $ligne; } close(FILE); $texte=~/encoding ?= ?[\'\"]([^\'\"]+)[\'\"]/i; my $encodage=$1; print "ENCODAGE : $encodage\n"; if ($encodage ne "") { my $texteXML="<file>\n"; $texteXML.="<name>$file</name>\n"; $texteXML.="<items>\n"; my $XMLtagged="<file>\n"; $XMLtagged.="<name>$file</name>\n"; $XMLtagged.="<items>\n"; my $texteBRUT=""; open(FILE,"<:encoding($encodage)", $file); $texte=""; while (my $ligne=<FILE>) { $ligne =~ s/\n//g; $ligne =~ s/\r//g; $texte .= $ligne; } close(FILE); $texte=~/[<channel>|<atom.+>]<title>([^<]+)<\/title>/; my $rub=$1; $rub=~ s/Le ?Monde.fr ?://g; $rub=~s/ ?: ?Toute l'actualité sur Le Monde.fr.//g; $rub=~s/\x{E8}/e/g; $rub=~s/\x{E0}/a/g; $rub=~s/\x{E9}/e/g; $rub=~s/\x{C9}/e/g; $rub=~s/ //g; $rub=uc($rub); $rub=~s/-LEMONDE.FR//g; $rub=~s/:TOUTEL'ACTUALITESURLEMONDE.FR.//g; print "RUBRIQUE : $rub\n"; #---------------------------------------- my $output1=$sortie.$rub.".xml"; my $output2=$sortie.$rub.".txt"; my $output3=$resBAO2.$rub."_tagged.xml"; if (!open (OUTXML,">>:encoding(utf-8)", $output1)) { die "Can't open file $output1"}; if (!open (OUTTXT,">>:encoding(iso-8859-1)", $output2)) { die "Can't open file $output2"}; if (!open (OUTTAG,">>:encoding(utf-8)", $output3)) { die "Can't open file $output3"}; my $rss = new XML::RSS; $rss->parsefile($file); foreach my $item(@{$rss->{'items'}}){ my $titre = $item->{'title'}; my $desc = $item->{'description'}; if (uc($encodage) ne "UTF-8"){ utf8($titre); utf8($desc); } $titre = HTML::Entities::decode($titre); $desc = HTML::Entities::decode($desc); $titre = &clean($titre); $desc = &clean($desc); if (!(exists $dicoTITRES{$titre}) and (!(exists $dicoDESC{$desc}))){ $dicoTITRES{$titre}++; $dicoDESC{$desc}++; my $titre_tag = &etiquetage($titre); my $desc_tag = &etiquetage($desc); $texteXML.="<item>\n<title>$titre</title>\n<description>$desc</description>\n</item>\n"; $XMLtagged.="<item>\n<title>\n$titre_tag</title>\n<description>\n$desc_tag</description>\n</item>\n"; print OUTTXT "$titre\n"; print OUTTXT "$desc\n"; } } $texteXML.="</items>\n</file>\n"; $XMLtagged.="</items>\n</file>\n"; print OUTXML $texteXML; print OUTTXT $texteBRUT; print OUTTAG $XMLtagged; close(OUTXML); close(OUTTXT); close(OUTTAG); } else { print "$file ==> encodage non détecté \n"; } } } } } sub clean { # Defined already in Toolbox1. } sub browser { # Defined already in Toolbox1. } sub etiquetage { my $texte=shift; my $temp="fichier_temp.txt"; open(TEMP, ">:encoding(utf-8)", $temp); # fichier temporaire pour stocker le texte print TEMP $texte; close(TEMP); system("perl5.20.1 tokenise-utf8.pl $temp | tree-tagger.exe french-utf8.par -lemma -token -no-unknown -sgml > $reper-etiquetage.txt"); # treetagger2xml system("perl5.20.1 treetagger2xml.pl $reper-etiquetage.txt"); open(TaggedOUT,"<:encoding(utf-8)","$reper-etiquetage.txt.xml"); my $texte_tag=""; while (my $ligne=<TaggedOUT>){ $texte_tag.=$ligne; } close(TaggedOut); return $texte_tag; }

Note that the only difference between the codes of this Toolbox with the first one is mainly the "etiquetage" function which is to trigger the TreeTagger. In case of reusing it on your system, pay attention to the directory addresses.

Maybe it is worth noting that each section took more than 6 hours! So sleep tight and leave your PC do its best (the photo has been taken during the processing in the black of night.