Search code examples
pythonpdfpypdf

Puts addtional space when extracting text from pdf using PyPDF2


I am working on the pdf file. Using Pypdf2 for text extraction. While extracting this file, i got the issue of the space between characters of the same word.

from PyPDF2 import PdfReader

reader = PdfReader("00001926B.pdf")
page = reader.pages[80]
text = page.extract_text()
print(text)

output is :

 2015 Microchip Technology Inc. DS00001926B-page 81LAN9354
9.2.2.6 100M Phase Lock Loop (PLL)
The 100M PLL locks onto the reference clock and generates the 125 MHz clock used to drive the 125 MHz logic and
the 100BASE-TX Transmitter.
9.2.3 100BASE-TX RECEIVE
The 100BASE-TX receive data path is shown in Figure 9-3 . Shaded blocks are those which are internal to the PHY.
Each major block is explained in the following sections.
9.2.3.1 100M Receive Input
The MLT-3 data from the cable is fed into the PHY on inputs RXPx and RXNx via a 1:1 ratio transformer. The ADC sam-
ples the incoming differential signal at a rate of 125M sa mples per second. Using a 64-level quantizer, 6 digital bits are
generated to represent each sample. The DSP adjusts the gain of the ADC according to the observed signal levels suchthat the full dynamic range of the ADC can be used.
9.2.3.2 Equalizer, BLW Correction and Clock/Data Recovery
The 6 bits from the ADC are fed into the DSP block. The equalizer in the DSP section compensates for phase and ampli-
tude distortion caused by the physical channel consisting of magnetics, connectors, and CAT- 5 cable. The equalizer
can restore the signal for any good-qua lity CAT-5 cable between 1m and 100m.
If the DC content of t he signal is such that the low-frequency comp onents fall below the low frequency pole of the iso-
lation transformer, then the droop characteristics of the transformer will become significant and Baseline Wander (BLW)
on the received signal will result. To prevent corruption of the received data, the PHY corrects for BLW and can receive
the ANSI X3.263-1995 FDDI TP-PMD defined “killer packet” with no bit errors.
The 100M PLL generates multiple phases of  the 125MHz clock. A multip lexer, controlled by the timing unit of the DSP,
selects the optimum phase for sampling the data. This is used as the received recovered clock. This clock is used to
extract the serial data from the received signal.
9.2.3.3 NRZI and MLT-3 Decoding
The DSP generates the MLT-3 recovered le vels that are fed to the MLT-3 converter. The MLT-3 is then converted to an
NRZI data stream.FIGURE 9-3: 100BASE-TX RECEIVE DATA PATH
Port x
MAC
A/D 
ConverterMLT-3 
ConverterNRZI 
Converter4B/5B 
Decoder
Magnetics CAT-5 RJ45100M 
PLL
Internal
MII 25MHz by 4 bitsInternal
MII Receive Clock
25MHz by
5 bits
NRZI
MLT-3 MLT-3 MLT-3
6 bit DataDescrambler 
and SIPO
125 Mbps Serial
DSP: Timing 
recovery, Equalizer 
and BLW CorrectionMLT-3MII MAC 
Interface25MHz
by 4 bits

The issued examples can be:

le vels instead of levels

good-qua lity instead of good-quality

Is there any way to solve this problem?


Solution

  • You need to understand why PDF is "Random contents" so the first page in that PDF is page 2 its parent is object #33760 (so page 1 can be miles later).

    Looking at Page 2 we see the text looks like this

    (LAN9354)
    

    Good at least it starts with the first line as large, then it draws a few graphics such as lines on the page to the bottom and moves to the left for more text at the bottom of the page

    [(DS0000192)-7.5(6B-page 2)]
    

    next it randomly goes around the page and ends up on the right

    [( 2015 Micr)-5.3(ochip T)-5.7(e)-.2(chnolo)-7.8(g)-.2(y Inc.)]
    

    the beams then travel to high in the centre and burn in carbon

    (TO OUR VALUED CUSTOMERS)
    

    then to the left for the first half width of a page as a column. Breaking to jump to and fro constantly between letters (they do not need to be printed in order either.

    [(It )7.5(is our int)-8.2(ention)-8.1( )7.5(to )7.5(pr)-5.6(ovide our valued customer)-5.6(s with)-8.1( )7.5(the bes)]TJ
    

    So all in all, not what you want, to see in an extracted order, of computerised bits and blobs. What you need is an extractor that works first on running only top to bottom a page buffered text objects and the best for that can be PDFtoText.

    The problem is you need to tell it to scan for the other half of that line by work as one page (so not do it by halves) otherwise you will not see t added to best

    [(t )-7.5(documentation possible to)-7.1( ensure )]TJ
    

    nor the other lines of that line

    [(successful use of )-7.5(yo)-7.3(ur Micro)]TJ
    [(chi)6.1(p)]TJ
    

    now we can combine all the lines that make up the second line,

    [(produ)-8(cts. To this end, we will con)-8(t)-.6(inue )]TJ
    

    etc. etc.

    Your "Question" was what about pages that look like there are only 2 sections of text and now if we go back to page 1 you can see a different need.

    Here we need all the pieces to be treated as one long line (or a few) to start, then only work collecting down the left, then down the right. So you need a different profile, in the application to work in top horizontal and 2 column vertical at unknown location.

    There is no easy answer as EVERY PDF page can be different and needs Artificial Inspection.

    enter image description here

    We can see the text is now in order where the bottom lines are truly placed at the end of page in "levels" of "good-quality" without any gaps (It can still be wrong of course, but less likely than other methods).

    can restore the signal for any good-quality CAT-5 cable between 1m and 100m.
    If the DC content of the signal is such that the low-frequency components fall below the low frequency pole of the iso-
    lation transformer, then the droop characteristics of the transformer will become significant and Baseline Wander (BLW)
    on the received signal will result. To prevent corruption of the received data, the PHY corrects for BLW and can receive
    the ANSI X3.263-1995 FDDI TP-PMD defined “killer packet” with no bit errors.
    The 100M PLL generates multiple phases of the 125MHz clock. A multiplexer, controlled by the timing unit of the DSP,
    selects the optimum phase for sampling the data. This is used as the received recovered clock. This clock is used to
    extract the serial data from the received signal.
    
    9.2.3.3        NRZI and MLT-3 Decoding
    The DSP generates the MLT-3 recovered levels that are fed to the MLT-3 converter. The MLT-3 is then converted to an
    NRZI data stream.
    
    
    
    
     2015 Microchip Technology Inc.                                                                    DS00001926B-page 81