La vie multilingue du mot "fête"

Script writing and Results

The first operation that our programme does is the aspiration of the urls by the following command:

wget --no-check-certificate -O ./PAGES-ASPIREES/$j/$i.html $ligne;

At this level, we have three columns, Urls, aspirated pages and retourwget.

In the second step, we type the following command:

lynx -dump -nolist -display_charset="$encodage" ./PAGES-ASPIREES/$j/$i.html > ./DUMP-TEXT/$j/$i-utf8.txt;

In this way, we obtain the content of the aspirated pages in text format.

We then type the command (display_charset)

encodageMeta=$(egrep -io "]+>" ./PAGES-ASPIREES/$j/$i.html | egrep -io "charset *=[^ \>]+" | cut -d= -f 2 | tr -d \" | tr -d \' | tr -d \> | tr -d " " | tr -d \/ | sort -u);which would enable us to save the pages in the encoding we want.

The Contextes allows us to assemble only the lines that we are intersted of in Dumpt utf-8 files. In order to identify these lines, we use the command egrep followed by our motive:

egrep -i "([Cc]elebrat)|([Ff][êe]t)|(جشن)" ./DUMP-TEXT/$j/$i-utf8.txt > ./CONTEXTES/$j/$i-utf8.txt;

The problem that we encountered in this command was that if we seperated our urdu motive with a space, it did not recognise the word . The "contexte" also contains two columns, one for the context files and the other contains HTML context. These were created by the Perl programme "minigrepmultilingue.pl". The result is moved in CONTEXTES with the following commands:

perl ./PROGRAMMES/minigrepmultilingue-v2.2-regexp/minigrepmultilingue.pl "UTF-8" ./DUMP-TEXT/$j/$i-utf8.txt ./PROGRAMMES/minigrepmultilingue-v2.2-regexp/motif-regexp.txt; mv ./resultat-extraction.html ./CONTEXTES/$j/$i-utf8.html;

The index are the dictionarries of the words contained in the Dump utf-8. We used the following command to generate these dictionnaries in our tables:

egrep -o "\w+" 1.txt | sort | uniq -c | sort -r

You can consult our Table menu for more details about the table output.

Finally, after making the word clouds, we used the software le trameur in which the co-occurence of our word is determined with other words in the context. For this, you can visit the link Trameur in the above menu.

Script Bash

In order to visualize the code, you can view it by clicking on this link.