I am generating text from pdf file with the help of pdftotext
My issue is not with pdftotext but it is with formating the text accordingly
Salman Madhuri Mohnish Renuka Anupam
Khan Dixit Behl Shahane Kher
Prem Nisha Chou... Rajesh Pooja Chou... Prof. Siddh
Hum Aapke Hain Koun...! (1994) - Full cast and crew
www.imdb.com/title/tt0110076/fullcredits
Hum Aapke Hain Koun...! on IMDb: Movies, TV, Celebs, and more... ... IMDbPro.com
offers representation listings for over 120,000 individuals, including actors, ...
I need output to be as
Salman Khan Prem
Madhuri Dixit Nisha Chou...
Mohnish Behl Rajesh
Renuka Shahane Pooja Chou...
Anupam Kher Prof.
Hum Aapke Hain Koun...! (1994) - Full cast and crew
www.imdb.com/title/tt0110076/fullcredits
Hum Aapke Hain Koun...! on IMDb: Movies, TV, Celebs, and more... ... IMDbPro.com
offers representation listings for over 120,000 individuals, including actors, ...
Not sure what your delimiters are, but you could something like the following (kinda ugly, but it gets the job done):
$namesAndContent = explode("\r\n\r\n", $theString);
$nameRows = explode("\r\n", $namesAndContent[0]);
$names = array();
foreach ($nameRows as $row) {
$items = preg_split('/\s{2,}/', $row);
foreach ($items as $index => $namePart) {
if (!array_key_exists($index, $names)) {
$names[$index] = array();
}
$names[$index][] = $namePart;
}
}
foreach ($names as $name) {
echo implode(' ', $name) . "\r\n";
}
echo "\r\n";
echo $namesAndContent[1];
Demo: http://codepad.viper-7.com/Nr1Q4t
The above would format the data (when the delimiters are correct), but I am wondering where the data is coming from (originaly and not the pdf), because I suspect there is a better way to solve your problem. Perhaps there is some API you can directly utilize