Search code examples
consoleopenoffice.orgword-count

Can openoffice count words from console?


i have a small problem i need to count words inside the console to read doc, docx, pptx, ppt, xls, xlsx, odt, pdf ... so don't suggest me | wc -w or grep because they work only with text or console output and they count only spaces and in japanese, chinese, arabic , hindu , hebrew they use diferent delimiter so the word count is wrong and i tried to count with this

pdftotext file.pdf -| wc -w
/usr/local/bin/docx2txt.pl < file.docx | wc -w
/usr/local/bin/pptx2txt.pl < file.pptx | wc -w
antiword file.doc -| wc -w 
antiword file.word -| wc -w

in some cases microsoft word , openoffice sad 1000 words and the counters return 10 or 300 words if the language is ( japanese , chinese, hindu ect... ) , but if i use normal characters then i have no issue the biggest mistake is in some case 3 chars less witch is "OK"

i tried to convert with soffice , openoffice and then try WC -w but i can't even convert ,

soffice --headless --nofirststartwizard --accept=socket,host=127.0.0.1,port=8100; --convert-to pdf some.pdf /var/www/domains/vocabridge.com/devel/temp_files/23/0/东京_1000_words_Docx.docx 

OR

 openoffice.org  --headless  --convert-to  ........

OR

openoffice.org3 --invisible 

so if someone know any way to count correctly or display document statistic with openoffice or anything else or linux with the console please share it

thanks.


Solution

  • I found the answer create one service

    #!/bin/sh
    #
    # chkconfig: 345 99 01
    #
    # description: your script is a test service
    #
    
    (while sleep 1; do
      ls pathwithfiles/in | while read file; do
        libreoffice --headless -convert-to pdf "pathwithfiles/in/$file" --outdir pathwithfiles/out
        rm "pathwithfiles/in/$file"
      done
    done) &
    

    then the php script that i needed counted everything

     $ext = pathinfo($absolute_file_path, PATHINFO_EXTENSION);
            if ($ext !== 'txt' && $ext !== 'pdf') {
                // Convert to pdf
                $tb = mktime() . mt_rand();
                $tempfile = 'locationofpdfs/in/' . $tb . '.' . $ext;
                copy($absolute_file_path, $tempfile);
                $absolute_file_path = 'locationofpdfs/out/' . $tb . '.pdf';
                $ext = 'pdf';
                while (!is_file($absolute_file_path)) sleep(1);
            }
            if ($ext !== 'txt') {
                // Convert to txt
                $tempfile = tempnam(sys_get_temp_dir(), '');
                shell_exec('pdftotext "' . $absolute_file_path . '" ' . $tempfile);
                $absolute_file_path = $tempfile;
                $ext = 'txt';
            }
            if ($ext === 'txt') {
                $seq = '/[\s\.,;:!\? ]+/mu';
                $plain = file_get_contents($absolute_file_path);
                $plain = preg_replace('#\{{{.*?\}}}#su', "", $plain);
                $str = preg_replace($seq, '', $plain);
                $chars = count(preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY));
                $words = count(preg_split($seq, $plain, -1, PREG_SPLIT_NO_EMPTY));
                if ($words === 0) return $chars;
                if ($chars / $words > 10) $words = $chars;
                return $words;
            }