L’attractivité de l'économie de la Birmanie

Le script PERL

Télécharcher le script

making it better:

  • make more independent subprograms (html_table is bit messy, but I bit lazy to do it now)
  • reading the keywords from a file or standard input from user (60)
  • maybe using references in perl it will be whole faster
  • ...

problems and solutions:

  •  (60) keyword in burmese have should be in this format: "x{1005}\x{102E}\x{1038}\x{1015}\x{103C}\x{102C}\x{1038}\x{1031}\x{101B}\x{1038}" not as "\xe1\x80\x85\xe1\x80\xae\xe1\x80\xb8\xe1\x80\x95\xe1\x80\xbc
  • \xe1\x80"(which was used in bash file) I'm not that sure why, but probably because I'm reading files as utf8, but I'm not sure. But anyway, like this it works prefectely.
  • I've got a typic problem with encoding, but I think it's more than good resolved using two modules in perl (Decode and Guess) and after all converting html entities into utf-8 (using Entities)
  • if a link is not downloaded its skipped and program pass onto next one
  • there are some cases when occurence wasn't find in dumped filed, it's caused probably by bad stuction of webpage (link inside link or too dynamic content, I don't know) usually is downloaded just title of page

Code source Comments
  1. #!/usr/bin/perl
  2. #use warnings;

  3. use LWP::Simple;
  4. use Encode;
  5. require Encode::Detect;
  6. use Encode::Guess;
  7. use Data::Dumper;
  8. use Time::Stopwatch;
  9. use HTML::TreeBuilder;
  10. use HTML::Entities;

  11. mkdir("../CONTEXTS");
  12. mkdir("../DUMP_TEXT");
  13. mkdir("../IMAGES");
  14. mkdir("../PAGES_ASPIREES");
  15. mkdir("../TABLEAUX");

  16. opendir(DIR, "../URLS") or die "I cannot open dir: $!";
  17. open(ERROR,">:utf8","error_reading_url.txt") or die "I'm not able to open a file.";

  18. print "\noperation started:\n";

  19. while (my $file = readdir(DIR)) { #open directory with urls
  20.         next unless ($file =~ m/\.txt$/); #open each file.txt
  21.         print "\n-> processing urls (".$file.")";
  22.         tie my $timer, 'Time::Stopwatch';
  23.         html_table($file);                #processing html file
  24.       print "\n    done in $timer s";
  25.      
  26.     }
  27.    
  28. close DIR;
  29. close ERROR;

  30. print "\n-> dumping results together";

  31. dump_globaux();
  32. dump_globaux_context();
  33. tabs_globaux();

  34. print "\n\nsuccesfully done\n";
  35. print "(list of wrong urls here: error_reading_url.txt)\n";

  36. ##################################################
  37. # SUBPROGRAMS                                    #
  38. ##################################################

  39. sub html_table {
  40. # create a html-table with urls and downloaded links
  41. # ARGUMENTS: file
  42.     my $file=$_[0];
  43.     my $lang=substr($file,0,-4);
  44.        my $n=1;
  45.     my $nf=0;
  46.     my $occurences;
  47.    
  48.     my %motivy = ("allemand" => "inve[rs]t", "anglais" => "inve[rs]t", "espagnol" => "inve[rs]t", "francais" => "invest", "myanmar" => "\x{1005}\x{102E}\x{1038}\x{1015}\x{103C}\x{102C}\x{1038}\x{1031}\x{101B}\x{1038}", "norvegien" => "invest", "tcheque" => "invest");
  49.     
  50.     my $motif = $motivy{$lang};  #searched word
  51.    
  52.     open(INPUT,"<:utf8","../URLS/".$file) or die "I'm not able to open a file.";  #open file with urls
  53.     open(EXP,">:utf8","../TABLEAUX/".$lang.".html") or die "I'm not able to open a file."; #create html file named by language
  54.     mkdir("../PAGES_ASPIREES/".$lang); #create directory for downloaded pages
  55.     mkdir("../DUMP_TEXT/".$lang); #create directory for dumped pages
  56.    
  57.     print EXP "<html>\n\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/htm;charset=utf-8\">\n<title>$lang</title>\n</head>\n\n<body bgcolor=\"#003\">\n\n<table align=\"center\" border=\"1\">\n<tr align=\"center\" bgcolor=\"#D2E6FF\"><td width=\"20px\"><b>n&deg</b></td><td><b>URL</b></td><td width=\"90px\"><b>PAGES<br>ASPIREES</b></td><td width=\"100px\"><b>DUMP initial</b></td><td width=\"100px\"><b>DUMP utf-8</b></td><td width=\"100px\"><b>CONTEXTE utf-8</b><br>motif: <blue>$motif</blue></td><td width=\"100px\"><b>CONTEXTE HTML</b><br>motif: <blue>$motif</blue></td><td width=\"60px\"><b>OCCS</b></td></tr>\n";
  58.     while(<INPUT>) {  
  59.     #reading each url file and processing it
  60.            chomp($_);
  61.            my $url=$_;
  62.           
  63.            $tmp=substr($_,0, 30);
  64.           
  65.            my $charset = download_html($_,$lang,$n); #download and charset
  66.                              
  67.            my $occs = context_txt_html($n,$lang,$motif);
  68.            $occurences = $occurences + $occs;
  69.             
  70.            if ($charset ne "not downloaded") {
  71.            #if page is not downloaded: row in table is not created, file is owerwrited by next successfully downloaded page
  72.          
  73.                 print EXP "<tr align=\"center\" bgcolor=\"#D2E6FF\"><td>$n</td><td><a href=\"$url\">$tmp</a></td><td><a href=\"../PAGES_ASPIREES/$lang/$n.html\">$n.html</a></td><td>";
  74.                 if ($charset ne "UTF-8" or $charset ne "utf8") {
  75.                     print EXP "<a href=\"../DUMP_TEXT/$lang/$n.txt\">$n.txt</a><br><small>($charset)</small>";
  76.                 } 
  77.                
  78.                 print EXP "</td><td><a href=\"../DUMP_TEXT/$lang/$n-utf8.txt\">$n-utf8.txt</a></td>";
  79.                
  80.                 if ($occs != 0) {
  81.                     print EXP "<td><a href=\"../CONTEXTS/$lang/$n-utf8.txt\">$n-utf8.txt</a></td><td><a href=\"../CONTEXTS/$lang/$n-utf8.html\">$n-utf8.html</a></td><td>$occs</td></tr>\n";
  82.                     $nf++;
  83.                    
  84.                 }               
  85.                 else{
  86.                     print EXP "<td bgcolor=\"#B0C4DE\">not found</td><td bgcolor=\"#B0C4DE\">not found</td><td bgcolor=\"#B0C4DE\">$occs</td></tr>\n";
  87.                 }
  88.                
  89.                 $n++;
  90.              
  91.               #svetlejsi #D2E6FF
  92.               #tmavsi    #B0C4DE
  93.              
  94.           }     
  95.       }
  96.     
  97.    print EXP "<tr align=\"center\" bgcolor=\"#B0C4DE\"><td></td><td></td><td></td><td></td><td></td><td bgcolor=\"#D2E6FF\"><a href=\"../FICHIERGLOBAUX/DUMP-GLOBAUX_$lang.txt\">Fichier DUMP global</a><br><small>$nf fichier(s)</small></td><td bgcolor=\"#D2E6FF\"><a href=\"../FICHIERGLOBAUX/CONTEXTES-GLOBAUX_$lang.txt\">Fichier CONTEXTES global</a><br><small>$nf fichier(s)</small></td><td bgcolor=\"#D2E6FF\"><b>$occurences</b></td></tr>\n";
  98.    print EXP "</table>\n\n</body></html>";
  99.   
  100.    close INPUT;
  101.    close EXP;
  102.   
  103. }
  104. sub download_html {
  105. # download each html file which got from url and return its encoding
  106. # ARGUMENTS: url, language, number
  107. # RETURNS: encoding
  108.     my $content = get($_[0]) or print ERROR $_[2]."\t".$_[1]."\t".$_[0]."\n";  #download of content of url
  109.     my $enc;
  110.    
  111.     if ($content) {
  112.     #saving of html file and getting encoding
  113.          
  114.           open(S,">:utf8","../PAGES_ASPIREES/".$_[1]."/".$_[2].".html") or die "I'm not able to open a file.";
  115.           print S $content; #saved to file
  116.           close S;
  117.          
  118.           $enc= Encode::Detect::Detector::detect($content);   #getting encoding
  119.        
  120.           if ($enc eq "") {   #guessing encoding
  121.                     $x = Dumper(guess_encoding($content)->name);
  122.                 $x =~ /\'(.*)\'/;
  123.                 $enc = $1;
  124.           }
  125.          
  126.           txt_save($content,$_[1],$_[2],$enc);
  127.           #dumping file (utf8 and if necessary originall)     
  128.     }
  129.        
  130.     else {
  131.         $enc = "not downloaded";
  132.     }
  133.    
  134.     return $enc; #returns encoding
  135.        
  136. }
  137. sub txt_save {
  138. # saves a dumped text
  139. # ARGUMENTS: html-content, language, number, charset                   
  140.    
  141.     open(TXTU,">:utf8","../DUMP_TEXT/".$_[1]."/".$_[2]."-utf8.txt") or die "I'm not able to create dumped txt-utf8 file.";
  142.    
  143.     if ($_[3] ne "UTF-8" and $_[3] ne "utf8") {
  144.     #non utf-8 => save and convert
  145.         open(TXT,">:utf8","../DUMP_TEXT/".$_[1]."/".$_[2].".txt") or die "I'm not able to create dumped txt file.";   
  146.         print TXT delete_html($_[0]);
  147.         $_[0]= decode($_[3], $_[0] );   
  148.     }
  149.    
  150.     print TXTU decode_entities(delete_html($_[0]));
  151.    
  152.     close TXTU;            
  153. }
  154. sub delete_html{
  155. # delete every html tag
  156. # ARGUMENTS: html-content
  157. # RETURNS: content-without-html
  158.     my $tree = HTML::TreeBuilder->new;
  159.     $tree->parse($_[0]);
  160.     $non_html = $tree->as_text();
  161.     return $non_html;
  162. }
  163. sub context_txt_html {
  164. # finds a KWIC and print it into TXT, HTML
  165. # ARGUMENTS: number, language, motif
  166. # RETURNS: number-of-occurence
  167.     my $n=$_[0];
  168.     my $lang=$_[1];
  169.     my $motif=$_[2];
  170.     my $context=200;
  171.     my $x = 0;
  172.    
  173.     #print "\ncislo souboru: $n\tjazyk: $lang\tmotiv: $motif";
  174.    
  175.     mkdir("../CONTEXTS/".$lang); #create directory for contexts
  176.     #print "\nadresar vytvoren";
  177.     open(UTF_TXT,"<:utf8","../DUMP_TEXT/".$lang."/".$n."-utf8.txt"); #or print "\n\tI'm not able to read dumped file $n-utf.8.txt, but I keep working, you don't need to be worry about."; #pokud problem, tak smazat ten posledni konstrukt
  178.     open(CONT_TXT,">:utf8","../CONTEXTS/".$lang."/".$n."-utf8.txt") or die "I'm not able to create context $n-utf.8.txt.";
  179.     open(CONT_HTML,">:utf8","../CONTEXTS/".$lang."/".$n."-utf8.html") or die "I'm not able to create context $n-utf.8.html.";
  180.    
  181.     #print "\nsoubory otevreny";
  182.    
  183.     print CONT_HTML "<html>\n<head>\n<title>minigrep : $motif ($lang)</title>\n";
  184.     print CONT_HTML "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf8\" />\n";
  185.     print CONT_HTML "<body bgcolor=\"white\">\n";
  186.     print CONT_HTML "<p align=\"justify\"><b>Input file :</b> $n-utf.txt<br>\n";
  187.     print CONT_HTML "<b>KeyWord :</b> $motif<br>\n";
  188.     print CONT_HTML "<b>Language :</b> $lang<br>\n";
  189.     print CONT_HTML "<b>Number of context-characters :</b> +- $context</p>\n<hr><hr>\n";
  190.    
  191.     print CONT_TXT "LEFT CONTEXT\tKEY-WORD\tRIGHT CONTEXT\n";
  192.    
  193.     #print "\nposledni radka vytisknuta, samotne zpracovani UTF_TXT";
  194.    
  195.     while(<UTF_TXT>) {  
  196.         chomp($_);
  197.  
  198.         #print "oriznuto";
  199.                
  200.         while ($_ =~/($motif[^ ]*)/gi) {
  201.        
  202.         #searching for keyword
  203.              
  204.             #print "motiv nalezen";
  205.            
  206.             $x++;
  207.             my $cl = $context;
  208.             my $cr = $context;
  209.            
  210.             my $motif_ok= $1;        #potřebuju to tam?
  211.             my $left = $`;           #potřeba vypsat jenom xy znaků před a po keyword
  212.             my $right = $';
  213.            
  214.             ### left context
  215.             if (length($left)>$cl) {
  216.                 until (substr($left,-$cl-1,1) eq " ") {   #gets first whole word in a row
  217.                    $cl=$cl-1;
  218.                 }
  219.                 $left = substr($left,-$cl,$cl);      #print xy first characters, beging with whole word
  220.             }
  221.            
  222.             ### right context
  223.             if (length($right)>$cr) {
  224.               until (substr($right,$cr+1,1) eq " ") {
  225.                 $cr=$cr-1;
  226.               }
  227.               $right = substr($right,0,$cr);
  228.             }
  229.            
  230.             ### printing
  231.             print CONT_TXT $left."\t".$motif_ok."\t".$right."\n"; 
  232.             print CONT_HTML "<p align=\"justify\">$left<font style=\"background-color: #FC3;\"><b>$motif_ok</b></font>$right</p>\n<hr>\n";
  233.            
  234.             }
  235.        
  236.         print CONT_HTML "</body></html>\n";
  237.         close CONT_TXT;
  238.         close CONT_HTML;
  239.         close UTF_TXT;
  240.     }
  241.     if ($x == 0){
  242.       unlink("../CONTEXTS/".$lang."/".$n."-utf8.txt");
  243.       unlink("../CONTEXTS/".$lang."/".$n."-utf8.html");
  244.     }
  245.     return $x;
  246. }
  247. sub dump_globaux {
  248. #makes global dumped file (from DUMP_TEXT) for every language (directory), containg every dumped file, saves into created file FISCHIERGLOBAUX
  249.   mkdir("../FICHIERGLOBAUX/");
  250.   opendir(DIR, "../DUMP_TEXT") or die "I cannot open dir: $!";
  251.  
  252.   while (my $dir = readdir(DIR)) { #open directory with urls
  253.      next unless ($dir !~ m/\./);
  254.           
  255.      opendir(DIRE, "../DUMP_TEXT/".$dir) or die "I cannot open dir: $!";
  256.      open(DUMP,">:utf8","../FICHIERGLOBAUX/DUMP-GLOBAUX_".$dir.".txt") or die "I'm not able to create dump_globaux for $dir.";
  257.        
  258.         while (my $file = readdir(DIRE)) { #open directory with urls
  259.            next unless ($file =~ m/\.txt$/);
  260.                         
  261.            open(INPUT,"<:utf8","../DUMP_TEXT/".$dir."/".$file) or die "I'm not able to open a file $file in directory $dir.";
  262.            while(<INPUT>) {
  263.               print DUMP $_;
  264.            }
  265.           
  266.            close INPUT;
  267.        
  268.         }
  269.     closedir DIRE;
  270.     close DUMP;
  271.          
  272.   }
  273.  
  274.   closedir DIR;
  275. }
  276. sub dump_globaux_context {
  277. #makes global dumped file (from CONTEXTS) for every language (directory), containg every context-file (whithout headder), saves into FISCHIERGLOBAUY
  278.   opendir(DIR, "../CONTEXTS") or die "I cannot open dir: $!";
  279.  
  280.   while (my $dir = readdir(DIR)) { #open directory with urls
  281.      next unless ($dir !~ m/\./);
  282.           
  283.      opendir(DIRE, "../CONTEXTS/".$dir) or die "I cannot open dir: $!";
  284.      open(DUMP,">:utf8","../FICHIERGLOBAUX/CONTEXTES-GLOBAUX_".$dir.".txt") or die "I'm not able to create dump_globaux for $dir.";
  285.        
  286.         while (my $file = readdir(DIRE)) { #open directory with urls
  287.            next unless ($file =~ m/\.txt$/);
  288.                         
  289.            open(INPUT,"<:utf8","../CONTEXTS/".$dir."/".$file) or die "I'm not able to open a file $file in directory $dir.";
  290.            while(<INPUT>) {
  291.               next unless ($_ !~ m/RIGHT CONTEXT$/);
  292.               print DUMP $_;
  293.            }
  294.            close INPUT;
  295.        
  296.         }
  297.  
  298.   closedir DIRE;
  299.   close DUMP;
  300.          
  301.   }
  302.   closedir DIR;     
  303. }
  304. sub tabs_globaux {
  305. #dumping tableaux files into one (taking every line which begin with "<TR"), giving header
  306.   opendir(DIR, "../TABLEAUX") or die "I cannot open dir: $!";
  307.   open(EXP,">:utf8","../TABLEAUX/tabs_globaux.html") or die "I'm not able to create dump_globaux for $dir.";
  308.   my $n=0;
  309.  
  310.   print EXP "<html>\n\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/htm;charset=utf-8\">\n<title>Tableux_globaux</title>\n</head>\n\n<body bgcolor=\"#4B0082\">\n\n";
  311.  
  312.   while (my $file = readdir(DIR)) { #open directory with urls
  313.      next unless ($file =~ m/\.html/ and $file ne "tabs_globaux.html");
  314.      my $fic=substr($file,0,-5);
  315.      $n++;
  316.     
  317.      print EXP "<table align=\"center\" border=\"1\">\n";
  318.      print EXP "<tr><td colspan=\"8\" align=\"center\" bgcolor=\"black\"><font color=\"white\"><b>Tableau n&deg $n</b>\<br><small>fichier : $fic.txt</small></font></td></tr>";
  319.                             
  320.      open(INPUT,"<:utf8","../TABLEAUX/".$file) or die "I'm not able to open a file $file in directory $dir.";
  321.      while(<INPUT>) {
  322.         next unless ($_ =~ m/<tr/);
  323.         print EXP $_;
  324.      }
  325.     
  326.      close INPUT;
  327.      print EXP "</table>\n\n<br><hr><br>\n\n";
  328.        
  329.   }
  330.  
  331.   print EXP "</body></html>";
  332.  
  333.   closedir DIR;
  334.   close EXP;       
  335. }  



    4-11 calling  used modules








    13-17 making directories which script uses, it expect being placed in directory "PROGRAMME"




    19 opening directory containing urls in text file
    20 creating a file for saving errors occured during reading the urls files


    24-31 reading every txt file (containging urls) in dir URL

    27 start counting the time
    28 runs subprogram html_table (49-114), gives variable contaning name of file
    29 stopping the time and printing it







    38-40 calling subprograms making global files (see bellow)







    ---------------------------------------
    SUBPROGRAM HTML_TABLE
    creates a html-file with links and runs subprograms which are getting and processing webpages
    ARGUMENT: file

    54 extract name of langue from name of file




    59 hash containg keywords which will be finaly searched in text



    61 keyword for actual language saved into variable
    63 opening a file containging urls
    64 making html page coresponding to language

    65-66 creating directories for saving the files named by language

    68 printing first part of html into html file








    70 reading text file containging urls




    77 calling download_html (arg: url, language, number) which saves it and returns encoding
    79 calling context_txt_html (arg: number, langue, keyword), return occurence in the document

    80 counting occurences for whole language

    81-106 if is page not downloaded skipped this part, number is not increased and I go for next file (not continuating with table)

    85 other part of row in table
    86-88 if charset isn't utf-8 printing link for non-converted file





    92-99 verifing occurences - if 0 printing not found, else prints links into table









    101 changing number, it means switching onto next line of url-file (next link)






    109-110 printing end of table and html document


    ---------------------------------------
    SUBPROGRAM DOWNLOAD_HTML
    download each html file which got from url and return its encoding
    ARGUMENTS: url, language, number
    RETURNS: encoding

    122 downloading pages content (using module LWP::Simple), if it's not downloaded, it prints an error into the error file.
    125-142 if web site downloaded:


    129 saving into a html file


    132 detecting content encoding (module Encode::Detect)
    134 if it's not detected it try to guess it (module Encode::Guess)




    140 calling a sub-program which makes a txt file from html.



    144-146 if not downloaded, encoding is set as "not downloaded"

    148 returns encoding/not downloaded


    ---------------------------------------
    SUBPROGRAM TXT_SAVE
    saves a dumped text
    ARGUMENTS: html-content, language, number, charset


    158-163 if it's encoding isn't utf-8 save a non converted copy without html tags (using delete_html subprogram)




    165 saves txt files in utf-8 without html tags (using delete_html subprogram)


    ---------------------------------------
    SUBPROGRAM DELETE_HTML
    delete every html tag
    ARGUMENTS: html-content
    RETURNS: content-without-html

    174-177 using features of HTML::TreeBuilder;



    ---------------------------------------
    SUBPROGRAM CONTEXT_TXT_HTML
    finds a KWIC and print it into TXT, HTML
    ARGUMENTS: number, language, motif
    RETURNS: number-of-occurence



    187 setting a length of context around key-word






    194-196 opening a source file and two export files (txt, html)





    200-206 printing header of html and additional info






    208 printing header of txt file



    212-261 reading of the source file and looking up for context and exporting, counting number of occs ($x)

















    232-237 and 239-244 looking up for first/last whole word in range of context length

























    258-261 if keyword wasnt found, deleting of files created in this case




    ---------------------------------------
    SUBPROGRAM DUMP_GLOBAUX
    makes global dumped file (from DUMP_TEXT) for every language (directory), containg every dumped file, saves into created file FISCHIERGLOBAUX

    270-290 browsing the directory and opening every txt file









    281 printing content of txt files into one dumped file











    ---------------------------------------
    SUBPROGRAM DUMP_GLOBAUX_CONTEXT
    same as subprogram above, just one difference, it skips first line in a file.



























    ---------------------------------------
    SUBPROGRAM TABS_GLOBAUX
    dumping tableaux files into one (taking every line which begin with "<TR"), giving header to the tab


    330 and 337-338 printing beggining of the global html file

    332-333 and 340 browsing directory and opening a html file









    341-344 reading a html file and every line which begins with "tr" is printed into global html file



    347 closing html-table



    351 closing html file