Toolbox III

Posted on 18 March 2015 by Fahad - Michael - Sina

In this section, we will deal with the tagged files and search for the syntactic patterns. We will locate the following sequences from our filtered files obtained in Toolbox II(BAO2): NOM ADJ (NOM: Noun - ADJ: Adjective) and NOM-PREP-NOM (Noun-Preposition-Noun). For this, we will apply few methods in order to obtain the desired results. The methods are as follows:

Cordial output processing (plain text): For this method, we will use three Perl programs that uses lists.
Treetagger2xml output processing: We will consider for this part the XPATH.

Plain Texts:

The three methods proposed here are based on the same principle: the programs take as input a Cordial file and a file containing the units to be extracted, and involve placing POS in lists to compare with intricate or patterns.

Although the program can take as input a file containing several patterns, we separate here the extractions in order to get an output file for each pattern. We thus will have a pattern file for the NOM ADJ pattern and another for NOM-PREP-NOM.

The basic program takes as input one file at a time only (thus forcing the user to restart the Cordial for each output), so we created a program to automate this task, which repeats the same technique used for the Toolbox I:

The codes

The only point about the codes is that it is, in this structure, supposed to extract "NOM ADJ NOM".


open(FILE,"$ARGV[0]");
my @lignes=<FILE>;
while (@lignes) {
    my $ligne=shift(@lignes);
    chomp $ligne;
    my $sequence="";
    my $longueur=0;
    if ( $ligne =~ /^([^\t]+)\t[^\t]+\tNC.*/) {
	my $forme=$1;
	$sequence.=$forme;
	$longueur=1;
	my $nextligne=$lignes[0];
	my $nextligne2=$lignes[1];
	if ( $nextligne =~ /^([^\t]+)\t[^\t]+\tADJ.*/) {
		my $forme=$1;
		$sequence.=" ".$forme;
		$longueur=2;
	if ( $nextligne2 =~ /^([^\t]+)\t[^\t]+\tNC.*/) {
		 my $forme=$1;
			$sequence.=" ".$forme;
			$longueur=3;
	}
    }
    }
}

The Second Method:

In this method (method of Mr. Daube), we will read the text file until a strong punctuation. In our Cordial files, this punctuation is annotated as PCTF. In this way, we will be able to check if there is any link or correspondence with a grammatical category that we are looking for. The command that we will integrate in our programme would be as follows:

NOM ADJ: NC.. ADJ..
NOM PREP NOM: NC.. PREP NC..

The result that we obtain is only one file per rubric which contains the two patterns. A shortcut to treat all Cordial folders in output rather than one file, we added a command that enables so. For this, we replaced the name of the programme in the system command () and run it with this programme.

The codes

The following programme enabled us to obtain the required result:

#!/usr/bin/perl
open(FIC, $ARGV[0]) or die "Problem : $! \n";
my @token=();my @lemme=();my @pos=();

while (my $ligne=<FIC>){
chomp $ligne;
$ligne=~s/\r//g; 
my @liste=split(/\t/,$ligne);
if ($liste[2]!~/PCTF/){
push(@token,$liste[0]);
push(@lemme,$liste[1]);
push(@pos,$liste[2]);
}
else {
my $suite_pos=join(" ",@pos);
open(PATRON, $ARGV[1]) or die "Problem\n";
while (my $patron=<PATRON>){
chomp $patron;
$patron=~s/\r//g;
while ($suite_pos=~/$patron/g){
	my $avant=$`; 
	my $j=0;
	while ($avant=~/ /g){
		$j++;
	}				
	my $k=0;
	while ($patron=~/ /g){
		$k++;
	}
	print "@token[$j..$j+$k] \n";
}			
}
close(PATRON);
@token=();@lemme=();@pos=();
}
}

close(FIC);
exit;

The Third Method:

This method is designed by Rashid for our course this semester. In this method, we consider the ngrams by counting the number(s) of POS contained in our pattern file. For a file containing a sequence of many pattern, the programme gives, as output, a single file which contains all the patterns preceded by the name of the patterns. In this way, we can separate them and keep them in the same file simultaneously.

We noticed that few messages were displayed in the standard entry during the execution of this programme, so we deleted this step, and thus the creation of this file (resultat.txt) was deleted. For this, we did the same procedure as in the second method, that is, by modifying the name of the programme in the command system system().

XML OUTPUT:

XPath:

In this method, we integrate the Xpath model in our Perl programme via XML::XPath modules. This programme takes as input, an XML file that we obtained in TOOLBOX2 and a text file that contains the sequence(s) that we are looking for. It creates a text file per sequence in the output. Again, we modified the programme so as not to run the programme manually for every file.

The programme has two main functions: &construit_XPath and &extract_pattern. The first one creates the path by considering the list of tokens that matches with each element of the desired pattern. This is done by the command split.

We then start with the path Xpath by selecting the first element that contains among its threads, the attribute type. To take into account the various components of the pattern, we created a string. We then use this function in &extract_pattern, which finds the tag that contains the pattern (NOM, ADJ …).

#/usr/bin/perl
use strict;
use utf8;
use XML::LibXML;
binmode STDIN,  ':encoding(utf8)';
binmode STDOUT, ':encoding(utf8)';

if($#ARGV!=1){
	print "usage : perl $0 fichier_tag fichier_motif";
	exit;
}
my $tag_file= shift @ARGV;
my $patterns_file = shift @ARGV;
my $xp = XML::LibXML->new(XML_LIBXML_RECOVER => 2); 
$xp->recover_silently(1);
my $dom= $xp->load_xml( location => $tag_file );
my $root= $dom->getDocumentElement();
my $xpc= XML::LibXML::XPathContext->new($root);
open(PATTERNSFILE, $patterns_file) or die "can't open $patterns_file: $!\n";
while (my $ligne = <PATTERNSFILE>) {
	&extract_pattern($ligne);
}
close(PATTERNSFILE);

sub construit_XPath{
	my $local_ligne=shift @_;
	my $search_path="";
	chomp($local_ligne);
	$local_ligne=~ s/\r$//;
	my @tokens=split(/ /,$local_ligne);
	$search_path="//element[contains(data[\@type=\"type\"],\"$tokens[0]\")]";
	my $i=1;
	while ($i < $#tokens) {
		$search_path.="[following-sibling::element[1][contains(data[\@type=\"type\"],\"$tokens[$i]\")]";
		$i++;
	}
	my $search_path_suffix="]";
	$search_path_suffix=$search_path_suffix x $i;
	$search_path.="[following-sibling::element[1][contains(data[\@type=\"type\"],\"".$tokens[$#tokens]."\")]".$search_path_suffix;
	return ($search_path,@tokens);
}

sub extract_pattern{
	my $ext_pat_ligne= shift @_;
	my ($search_path,@tokens) = &construit_XPath($ext_pat_ligne);
	my $match_file = "res_extract-".join('_', @tokens).".txt";
	open(MATCHFILE,">:encoding(UTF-8)", "$match_file") or die "can't open $match_file: $!\n";
	my @nodes=$root->findnodes($search_path);
	foreach my $noeud ( @nodes) {
		my $form_xpath="";
		$form_xpath="./data[\@type=\"string\"]";
		my $following=0;
		print MATCHFILE $xpc->findvalue($form_xpath,$noeud);
		
		while ( $following < $#tokens) {
			$following++;
			my $following_elmt="following-sibling::element[".$following."]";			
			$form_xpath=$following_elmt."/data[\@type=\"string\"]";
			print MATCHFILE " ",$xpc->findvalue($form_xpath,$noeud);
		}
		print MATCHFILE "\n";
	}
	close(MATCHFILE);
}

The time it takes to treat files for BOA2 is “mind blowing”. At least 30-45 minutes with only one pattern.

View the output of Voyage
View the output of Sport