I'm using fopen
in PHP to open a file extracted from tesseract OCR. The returned text contains <<<<<<
, fopen
reads till it finds the first <
character then stops.
File returned from OCR:
P<dsdasdasd<<dasd<adsda<dsada<<<<<<<<<<ec<
dasdasdsdasdasdasdasd<<<<<<<<<<<<<<06
£ y
The echo from fopen
:
P
If I view source, I found rest text in red color.
Code I used:
<?php
file_put_contents("tmpFile.jpg",file_get_contents("1.jpg"));
$cmd = "tesseract tmpFile.jpg ee ";
exec($cmd);
$myfile = fopen("ee.txt", "r") or die("Unable to open file!");
$data= fread($myfile,100000000);
fclose($myfile);
echo $data;
?>
I paste the text in question and it also hidden.
Screenshot while I typing question and the text hidden in question:
Screen-shot from output and viewsource:
As far as I can see, the issue has nothing to do with tesseract or your input text file.
fopen
reads till it finds the first < character then stops
I don't think that's true. Why would you see the rest of the source in "view source", then? fopen
reads the whole file but the issue is with displaying that information in your browser.
You want to display characters which are reserved for HTML tags - in this case a <
("less than" symbol). That's why you get red text in "view source" since the browser doesn't know how to interpret the HTML code.
As a first workaround, just put a <textarea>
tag around your <?php
to view the data:
<textarea><?php
/* ...
your regular code here
... */
?></textarea>
The next step should be to encode those special characters before you give them to echo
. Have a look at htmlspecialchars
or htmlentities
.
You might also find useful information on the topic at: