Toolbox I

Posted on 6 March 2015 by Fahad - Michael - Sina

This first toolbox was intended to extract and clean useful information from the corpus containing RSS newsfeed XML files. By cleaning we mean ridding the content we wish to use of any hypertext links or any additional content which is intended to be read by a machine and not a human. What we are left with is the title and description of each article uploaded or updated on the leMonde website.

In the rest of this page, we are going to explain our approach in implementation Toolbox I. As an academic experience, we have coded the project in Perl scripting language(it doesn't mean we are in love with Perl! Read even more about how we found Perl).

Here are the tasks of Toolbox I:

Accessing each directory of the XML tree of the corpus
Identifying names of the existing blocks (sport, culture, etc.)
Extracting the title and the description of each XML file (including cleaning measures)

During the sessions, we have tried two ways to deal with this phase which are shown below:

Raw documents are the documents in the corpus with the default structure. What we are going to continue our project with are the texts included in the "description" and "title" tags as shown belows:

The codes

Regular Expression Solution

This first part of the script is concerned with going through the file structure containing the files we want to use for the extraction. The script verifies that there are files ending with ".xml" and when it is in fact the case, extracts the content within, as long as it is correctly encoded.

sub browser {
my $path = shift(@_);
opendir(DIR, $path) or die "can't open $path: $!\n";
my @files = readdir(DIR);
closedir(DIR);
foreach my $file (@files) {
next if $file =~ /^\.\.?$/;
$file = $path."/".$file;
if (-d $file) {
	&browser($file);	
}
if (-f $file) {
	if (($file=~/\.xml$/) && ($file!~/\/fil.+\.xml$/)) {
	open(FILE,$file);
	my $texte="";
	while (my $ligne=<FILE>) {
		$ligne =~ s/\n//g;
		$ligne =~ s/\r//g;
		$texte .= $ligne;
	}
	close(FILE);
	$texte=~s/> *</></g;
	$texte=~/encoding ?= ?[\'\"]([^\'\"]+)[\'\"]/i;
	my $encodage = $1;
	if ($encodage ne "") {
	open(FILE,"<:encoding($encodage)", $file);
	$texte="";
	while (my $ligne=<FILE>) {
	$ligne =~ s/\n//g;
	$ligne =~ s/\r//g;
	$texte .= $ligne;
	}
	close(FILE);
	$texte =~ s/> *</></g;
	
	if ($texte=~ /[<channel>|<atom.+>]<title>([^<]+)<\/title>/) {
	my $rub=$1;
	$rub=~s/Le ?Monde.fr ?://g;
	$rub=~s/ ?: ?Toute l'actualité sur Le Monde.fr.//g;
	$rub=~s/\x{E8}/e/g; 
	$rub=~s/\x{E0}/a/g;   
	$rub=~s/\x{E9}/e/g;  
	$rub=~s/\x{C9}/e/g;          
	$rub=~s/ //g;      
	$rub=uc($rub);  
	$rub=~s/-LEMONDE.FR//g;  
	$rub=~s/:TOUTEL'ACTUALITESURLEMONDE.FR.//g;
	$sections{$rub}++;    
	}
	}
	else {
		print "$file ==> encodage non détecté \n";
	}
	}
}
}
}

The next step involves separating the content according to each section. This requires another regular expression.

sub directoryBrowsing {
my $path = shift(@_);
opendir(DIR, $path) or die "Can't open $path: $!\n";
my @files = readdir(DIR);
foreach my $file (@files) {
next if $file =~ /^\.\.?$/;
$file = $path."/".$file;
if (-d $file) {
	&directoryBrowsing($file);	
}
if (-f $file) {
	if (($file=~/\.xml$/) && ($file!~/\/fil.+\.xml$/)) {
	open(FILE, $file);
	print "Traitement de : $file\n";
	my $texte="";
	while (my $ligne=<FILE>) {
		$ligne =~ s/\n//g;
		$ligne =~ s/\r//g;
		$texte .= $ligne;
	}
	close(FILE);
	$texte=~s/> *</></g;
	$texte=~/encoding ?= ?[\'\"]([^\'\"]+)[\'\"]/i;
	my $encodage=$1;
	print "ENCODAGE : $encodage\n";
	if ($encodage ne "") {
	my $texteXML="<file>\n";
	$texteXML.="<name>$file</name>";
	$texteXML.="<items>\n";
	my $texteBRUT="";
	open(FILE,"<:encoding($encodage)", $file);
	$texte="";
	while (my $ligne=<FILE>) {
	$ligne =~ s/\n//g;
	$ligne =~ s/\r//g;
	$texte .= $ligne;
	}
	close(FILE);
	$texte=~s/> *</></g;
	$texte=~/[<channel>|<atom.+>]<title>([^<]+)<\/title>/;
	my $rub=$1;
	$rub=~s/Le ?Monde.fr ?://g;
	$rub=~s/ ?: ?Toute l'actualité sur Le Monde.fr.//g;
	$rub=~s/\x{E8}/e/g;  
	$rub=~s/\x{E0}/a/g;   
	$rub=~s/\x{E9}/e/g;
	$rub=~s/\x{C9}/e/g;    
	$rub=~s/ //g;
	$rub=uc($rub);
	$rub=~s/-LEMONDE.FR//g;
	$rub=~s/:TOUTEL'ACTUALITESURLEMONDE.FR.//g;
	print "RUBRIQUE : $rub\n";
	my $output1=$sortie.$rub.".xml";
	my $output2=$sortie.$rub.".txt";
	if (!open (OUTXML,">>:encoding(utf-8)", $output1)) { die "Can't open file $output1"};
	if (!open (OUTTXT,">>:encoding(iso-8859-1)", $output2)) { die "Can't open file $output2"};

	while ($texte =~ /<item><title>(.+?)<\/title>.+?<description>(.+?)<\/description>/g) {
		my $titre=$1;
		my $desc=$2;
		my $rss = new XML::RSS;
	$rss->parsefile($file);
	
	foreach my $item(@{$rss->{'items'}}){
	my $titre = $item->{'title'};
	my $desc = $item->{'description'};

	if (uc($encodage) ne "UTF-8"){
	   utf8($titre);
	   utf8($desc);
	}				
	$titre = HTML::Entities::decode($titre);
	$desc = HTML::Entities::decode($desc);
	$titre = &clean($titre);
	$desc = &clean($desc);

	if (!(exists $titles{$titre}) and (!(exists $descriptions{$desc}))){	 
		$titles{$titre}++;
		$descriptions{$desc}++;
		$texteXML.="<item>\n<title>$titre</title>\n<description>$desc</description>\n</item>\n";  
		print OUTTXT "$titre\n";
		print OUTTXT "$desc\n";
	}
	}
	$texteXML.="</items>\n</file>\n";
	
	print OUTXML $texteXML;
	print OUTTXT $texteBRUT;

	close(OUTXML);
	close(OUTTXT);
}
	else {
		print "$file ==> Ereur d'encodage\n";
	}
	}
}
}
}

For most of the processing a function called "nettoyage" has been used in order to eliminate the unexpected encodings, non-standard characters and to substitute specific French characters with standard ones. Here is the function "nettoyage":

sub nettoyage {
my $texte=shift;
$texte=~s/<img[^>]+>//g;
$texte=~s/<a href[^>]+>//g;
$texte=~s/<\/a>//g;
$texte=~s/<[^>]+>//g;
$texte=~s/&/et/g;
$texte=~s/\x{201c}/“/g;
$texte=~s/\x{201d}/”/g;
$texte=~s/\x{2019}/’/g;
$texte=~s/\x{2018}/‘/g;
$texte=~s/\x{2013}/-/g;
$texte=~s/\x{2192}/→/g;
$texte=~s/\x{2026}/.../g;
$texte=~s/\x{0153}/œ/g; 
$texte=~s/\x{0152}/Œ/g;
$texte=~s/\x{20ac}/€/g;
return $texte;
}

In fact what we have programmed for any part of the project contains some details that are not mentioned in the boxes and may be downloaded in this link. Having said that those parts of the codes that seem to be redundant (including reading/writing from/in files, the libraries etc.) are not mentioned above.

And this is how the output looks like:

RSS-icon

View the output of Voyage
View the output of Sport