Search code examples
pythonawksedpoppler

Get only third and sixth column from command output of pdffonts


I am using poppler pdffonts to get fonts in a pdf document. Below is the sample output

$ pdffonts "some.pdf"
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TimesNewRoman                        TrueType          WinAnsi          no  no  no      36  0
TimesNewRoman,Bold                   TrueType          WinAnsi          no  no  no      38  0
EDMFMD+Symbol                        CID TrueType      Identity-H       yes yes yes     41  0
Arial                                TrueType          WinAnsi          no  no  no      43  0
Arial,Bold                           TrueType          WinAnsi          no  no  no      16  0

Now I want to get only "encoding" and "uni" column values in the above output. But I am unable to get because of inconsistent space in each row.

Tried methods(Python):

1) Split each line by space and join by space and then split, so that elements of indices 2 and 5 in the resulting list will give me required values for each line. This approach is failing because of spaces in row values.

Code sample:

for line in os.popen("pdffonts some.pdf").readlines():
    print ' '.join(line.split()).split()

output:

['name', 'type', 'encoding', 'emb', 'sub', 'uni', 'object', 'ID']
['------------------------------------', '-----------------', '----------------', '---', '---', '---', '---------']
['FMGLMO+MyriadPro-Bold', 'Type', '1C', 'Custom', 'yes', 'yes', 'yes', '127', '0']
['FMGMMM+MyriadPro-Semibold', 'Type', '1C', 'Custom', 'yes', 'yes', 'yes', '88', '0']
['Arial-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '90', '0']
['TimesNewRomanPSMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '92', '0']
['FMGMHL+TimesNewRomanPSMT', 'CID', 'TrueType', 'Identity-H', 'yes', 'yes', 'no', '95', '0']
['FMHBEE+Arial-BoldMT', 'CID', 'TrueType', 'Identity-H', 'yes', 'yes', 'no', '100', '0']
['TimesNewRomanPS-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '103', '0']

2) Use regex to split each line of the output with atleast two spaces. This approach is failing because now I cannot get index 5 is clubbed because only one space is present.

Code Sample:

for line in os.popen("pdffonts some.pdf").readlines():
    print re.split(r'\s{2,}', line.strip())

Output:

['name', 'type', 'encoding', 'emb sub uni object ID']
['------------------------------------ ----------------- ---------------- --- --- --- ---------']
['FMGLMO+MyriadPro-Bold', 'Type 1C', 'Custom', 'yes yes yes', '127', '0']
['FMGMMM+MyriadPro-Semibold', 'Type 1C', 'Custom', 'yes yes yes', '88', '0']
['Arial-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '90', '0']
['TimesNewRomanPSMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '92', '0']
['FMGMHL+TimesNewRomanPSMT', 'CID TrueType', 'Identity-H', 'yes yes no', '95', '0']
['FMHBEE+Arial-BoldMT', 'CID TrueType', 'Identity-H', 'yes yes no', '100', '0']
['TimesNewRomanPS-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '103', '0']

AWK: Failing because of space issue.Please compare with original output to get the difference.

$ pdffonts "some.pdf"|awk '{print $3}'

encoding
----------------
WinAnsi
WinAnsi
TrueType
WinAnsi
WinAnsi

Solution

  • You can collect string positions for every desired column:

    with open('pdffonts.txt') as f:
        header =f.readline()
        read_data = f.read()
    f.closed
    
    header_values = header.split()
    
    positions = {}
    for name in header_values:
        positions[name] = header.index(name)
    print(positions)
    

    This will give you the following example dictinary:

    {'name': 0, 'type': 37, 'encoding': 55, 'emb': 72, 'sub': 76, 'uni': 80, 'object': 84, 'ID': 91}
    

    After that you can specify the substring range to extract:

    desired_columns = []
    for line in read_data.splitlines()[1:]:
        encoding = line[positions['encoding']:positions['emb']].strip()
        uni = line[positions['uni']:positions['object']].strip()
        desired_columns.append([encoding,uni])
    
    print(desired_columns)
    

    result:

    [['WinAnsi', 'no'], ['WinAnsi', 'no'], ['Identity-H', 'yes'], ['WinAnsi', 'no'], ['WinAnsi', 'no']]