I am using poppler pdffonts to get fonts in a pdf document. Below is the sample output
$ pdffonts "some.pdf"
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TimesNewRoman TrueType WinAnsi no no no 36 0
TimesNewRoman,Bold TrueType WinAnsi no no no 38 0
EDMFMD+Symbol CID TrueType Identity-H yes yes yes 41 0
Arial TrueType WinAnsi no no no 43 0
Arial,Bold TrueType WinAnsi no no no 16 0
Now I want to get only "encoding" and "uni" column values in the above output. But I am unable to get because of inconsistent space in each row.
Tried methods(Python):
1) Split each line by space and join by space and then split, so that elements of indices 2 and 5 in the resulting list will give me required values for each line. This approach is failing because of spaces in row values.
Code sample:
for line in os.popen("pdffonts some.pdf").readlines():
print ' '.join(line.split()).split()
output:
['name', 'type', 'encoding', 'emb', 'sub', 'uni', 'object', 'ID']
['------------------------------------', '-----------------', '----------------', '---', '---', '---', '---------']
['FMGLMO+MyriadPro-Bold', 'Type', '1C', 'Custom', 'yes', 'yes', 'yes', '127', '0']
['FMGMMM+MyriadPro-Semibold', 'Type', '1C', 'Custom', 'yes', 'yes', 'yes', '88', '0']
['Arial-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '90', '0']
['TimesNewRomanPSMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '92', '0']
['FMGMHL+TimesNewRomanPSMT', 'CID', 'TrueType', 'Identity-H', 'yes', 'yes', 'no', '95', '0']
['FMHBEE+Arial-BoldMT', 'CID', 'TrueType', 'Identity-H', 'yes', 'yes', 'no', '100', '0']
['TimesNewRomanPS-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '103', '0']
2) Use regex to split each line of the output with atleast two spaces. This approach is failing because now I cannot get index 5 is clubbed because only one space is present.
Code Sample:
for line in os.popen("pdffonts some.pdf").readlines():
print re.split(r'\s{2,}', line.strip())
Output:
['name', 'type', 'encoding', 'emb sub uni object ID']
['------------------------------------ ----------------- ---------------- --- --- --- ---------']
['FMGLMO+MyriadPro-Bold', 'Type 1C', 'Custom', 'yes yes yes', '127', '0']
['FMGMMM+MyriadPro-Semibold', 'Type 1C', 'Custom', 'yes yes yes', '88', '0']
['Arial-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '90', '0']
['TimesNewRomanPSMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '92', '0']
['FMGMHL+TimesNewRomanPSMT', 'CID TrueType', 'Identity-H', 'yes yes no', '95', '0']
['FMHBEE+Arial-BoldMT', 'CID TrueType', 'Identity-H', 'yes yes no', '100', '0']
['TimesNewRomanPS-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '103', '0']
AWK: Failing because of space issue.Please compare with original output to get the difference.
$ pdffonts "some.pdf"|awk '{print $3}'
encoding
----------------
WinAnsi
WinAnsi
TrueType
WinAnsi
WinAnsi
You can collect string positions for every desired column:
with open('pdffonts.txt') as f:
header =f.readline()
read_data = f.read()
f.closed
header_values = header.split()
positions = {}
for name in header_values:
positions[name] = header.index(name)
print(positions)
This will give you the following example dictinary:
{'name': 0, 'type': 37, 'encoding': 55, 'emb': 72, 'sub': 76, 'uni': 80, 'object': 84, 'ID': 91}
After that you can specify the substring range to extract:
desired_columns = []
for line in read_data.splitlines()[1:]:
encoding = line[positions['encoding']:positions['emb']].strip()
uni = line[positions['uni']:positions['object']].strip()
desired_columns.append([encoding,uni])
print(desired_columns)
result:
[['WinAnsi', 'no'], ['WinAnsi', 'no'], ['Identity-H', 'yes'], ['WinAnsi', 'no'], ['WinAnsi', 'no']]