Search code examples
phptesseract

Error reading file extracted from tesseract using fopen PHP


I'm using fopen in PHP to open a file extracted from tesseract OCR. The returned text contains <<<<<<, fopen reads till it finds the first < character then stops.

File returned from OCR:

P<dsdasdasd<<dasd<adsda<dsada<<<<<<<<<<ec<
dasdasdsdasdasdasdasd<<<<<<<<<<<<<<06

£ y

The echo from fopen:

P

If I view source, I found rest text in red color.

Code I used:

<?php
file_put_contents("tmpFile.jpg",file_get_contents("1.jpg"));
$cmd = "tesseract tmpFile.jpg ee ";
exec($cmd);
$myfile = fopen("ee.txt", "r") or die("Unable to open file!");
$data= fread($myfile,100000000);
fclose($myfile);
echo $data;
?>

I paste the text in question and it also hidden.

Screenshot while I typing question and the text hidden in question:

enter image description here

Screen-shot from output and viewsource:

enter image description here


Solution

  • As far as I can see, the issue has nothing to do with tesseract or your input text file.

    fopen reads till it finds the first < character then stops

    I don't think that's true. Why would you see the rest of the source in "view source", then? fopen reads the whole file but the issue is with displaying that information in your browser.

    You want to display characters which are reserved for HTML tags - in this case a < ("less than" symbol). That's why you get red text in "view source" since the browser doesn't know how to interpret the HTML code.

    As a first workaround, just put a <textarea> tag around your <?php to view the data:

    <textarea><?php
    /* ...
    your regular code here
    ... */
    ?></textarea>
    

    The next step should be to encode those special characters before you give them to echo. Have a look at htmlspecialchars or htmlentities.

    You might also find useful information on the topic at: