In my python task, I've to read a PDF paper and get all the references with their count (mentioned in paper). This is the PDF as example and it has 18 references and say Ref#1 is mentioned in paper for like 3 times and Ref#2 is referred 1 times so this is how I want;
Ref# Count Reference
1 3 Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358.
2 1 Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John arroll, editor, Workshop on Robust Parsing, pages 54-69, Prague
...
I'm done with Ref # and References in a list, and somehow managed to get lines from text having Reference in them by using this regex:
regex = re.compile(r'[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}; [A-Za-z \u0000-\u007F,;]*\)|[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4},[A-Za-z0-9\u0000-\u007F ]*\)|[A-Z]{1}[a-z\u0000-\u007F ]+ [a-z]{2} [a-z]{2}. \([0-9]{4}\)')
So when I traverse list of String (Text splitted by sentences) and find by upper regex using this code:
for i in range(0, len(lstString)):
refLine = re.findall(regex, lstString[i])
if(refLine != [] and refLine [0] != []):
print(refLine)
I get some output like this:
(Karls- son et al., 1995)
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson
(1990)
(Tapanainen, 1996)
(Tapanainen, 1996) is dif- ferent from the former (Karlsson et al., 1995)
Hurskainen (1996)
In essence, the same formalism is used in the syn- tactic analysis in J~rvinen (1994) and Anttila (1995)
Our notation follows the classical model of depen- dency theory (Heringer, 1993) introduced by Lucien Tesni~re (1959) and later
advocated by Igor Mel'~uk (1987)
Hudson (1991)
(Hays, 1964)
(McCord, 1990; Sleator and Tem- perley, 1991; Eisner, 1996)
(Hudson, 1991)
(J~irvinen, 1994)
The CG-2 program (Tapanainen, 1996) runs a mod- ified disambiguation grammar of Voutilainen (1995)
(J~rvinen, 1994; Tapanainen and J/~rvinen, 1994)
(Eisner, 1996)
Dekang Lin (1996)
Acknowledgments We are using Atro Voutilainen's (1995)
It returns me all strings having References in them but I got some issues like
- It is not capturing Reference like this Karlsson et al. (1995)
- Some of these contains 2 reference in them
- How can I update count for each reference in reference list
I tried this code to get count for each ref but it always returns the whole list;
matching = [s for s in lstRef if any(xs in s for xs in refLine)]
Any Kind of help will be appreciated.
I was wondering what if to get names (and years) from References
at the end of document and use them to search references in document.
In previous question you get code which gets References
at the end of document.
Using regex '((.*)\. (\d{4})\.
I can get names as one string, year as one string (and eventually both in one string)
authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
text, authors, year = authors_and_year.groups()
ie.
text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
Using next regex ',[ ]*and |,[ ]*| and '
I can split string with names into list of names
names = re.split(',[ ]*and |,[ ]*| and ', authors)
and using normal split(" ")
I can get surnames (last names) which can be more useful then full name
names = [(name, name.split(' ')[-1]) for name in names]
ie.
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
And now I can use these names (or rather surnames) and years to generate strings like surname (year)
, surname, year
and search then in document.
If there are many surnames then I can get first surname adn generate surname et al. (year)
, etc.
And using these string and starndard string function text.count(generated_string)
I can count them.
At this moment it is all what I have but It is still not ideal.
You could find all references in document manually and use them to test code. And you would see which one are correctly counted and which needs more changes.
For example there is reference with 's
in text We are using Atro Voutilainen's (1995)
. Maybe document should be cleaned like in NLP
(Natural Language Processing) using nltk
And some native chars make problem - name Järvinen
in one place is extracted as J~rvinen
and in other place as J/irvinen
import PyPDF2
from PyPDF2.pdf import * # to import function used in origimal `extractText`
# --- functions ---
def myExtractText(self, distance=None):
# original code from `page.extractText()`
# https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
text = u_("")
content = self["/Contents"].getObject()
if not isinstance(content, ContentStream):
content = ContentStream(content, self.pdf)
prev_x = 0
prev_y = 0
for operands, operator in content.operations:
# used only for test to see values in variables
#print('>>>', operator, operands)
if operator == b_("Tj"):
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == b_("T*"):
text += "\n"
elif operator == b_("'"):
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == b_('"'):
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == b_("TJ"):
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
text += "\n"
if operator == b_("Tm"):
if distance is True:
text += '\n'
elif isinstance(distance, int):
x = operands[-2]
y = operands[-1]
diff_x = prev_x - x
diff_y = prev_y - y
#print('>>>', diff_x, diff_y - y)
#text += f'| {diff_x}, {diff_y - y} |'
if diff_y > distance or diff_y < 0: # (bigger margin) or (move to top in next column)
text += '\n'
#text += '\n' # to add empty line between elements
prev_x = x
prev_y = y
return text
# --- main ---
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text = ''
for page in pdfReader.pages:
#text += page.extractText() # original function
#text += myExtractText(page) # modified function (works like original version)
#text += myExtractText(page, True) # modified function (add `\n` after every `Tm`)
text += myExtractText(page, 17) # modified function (add `\n` only if distance is bigger then `17`)
# get only text after word `References`
pos = text.lower().find('references')
# only referencers as text
references = text[pos+len('references '):]
# doc without references
doc = text[:pos]
# referencers as list
references = references.split('\n')
# remove empty lines and lines which have 2 chars (ie. page number)
references = [item.strip() for item in references if len(item.strip()) > 2]
print('\n--- names ---\n')
data = []
for nubmer, line in enumerate(references, 1): # skip last element with page number
line = line.strip()
if line: # skip empty line
authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
text, authors, year = authors_and_year.groups()
#print(text, '|', authors, '|', year)
names = re.split(',[ ]*and |,[ ]*| and ', authors)
#print(names)
# [(name, last_name), ...]
names = [(name, name.split(' ')[-1]) for name in names]
#print(names)
#print(' line:', line)
print(' text:', text)
print('authors:', authors)
print(' year:', year)
print(' names:', names)
print('---')
data.append((authors, names, year))
print('\n--- counting ---\n')
# https://guides.lib.monash.edu/citing-referencing/APA-In-text
# Tapanainen and J/~rvine,
for authors, names, year in data:
print('authors:', authors)
print(' year:', year)
print(' names:', names)
print(' et al.:', len(names) > 1)
print(' and :', len(names) == 2)
print('---')
first_lastname = names[0][-1]
print(doc.count(first_lastname), first_lastname)
print(doc.count(first_lastname + ', ' + year), first_lastname + ', ' + year)
print(doc.count(first_lastname + ' (' + year + ')'), first_lastname + ' (' + year + ')')
if len(names) > 1:
first_lastname_et_al = first_lastname + ' et al.'
print(doc.count(first_lastname_et_al), first_lastname_et_al)
print(doc.count(first_lastname_et_al + ', ' + year), first_lastname_et_al + ', ' + year)
print(doc.count(first_lastname_et_al + ' (' + year + ')'), first_lastname_et_al + ' (' + year + ')')
if len(names) == 2:
all_lastnames = ' and '.join(item[-1] for item in names)
print(doc.count(all_lastnames), all_lastnames)
print(doc.count(all_lastnames + ', ' + year), all_lastnames + ', ' + year)
print(doc.count(all_lastnames + ' (' + year + ')'), all_lastnames + ' (' + year + ')')
print('----------')
Result for names extracting:
--- names ---
text: Arto Anttila. 1995.
authors: Arto Anttila
year: 1995
names: [('Arto Anttila', 'Anttila')]
---
text: Dekang Lin. 1996.
authors: Dekang Lin
year: 1996
names: [('Dekang Lin', 'Lin')]
---
text: Jason M. Eisner. 1996.
authors: Jason M. Eisner
year: 1996
names: [('Jason M. Eisner', 'Eisner')]
---
text: David G. Hays. 1964.
authors: David G. Hays
year: 1964
names: [('David G. Hays', 'Hays')]
---
text: Hans Jiirgen Heringer. 1993.
authors: Hans Jiirgen Heringer
year: 1993
names: [('Hans Jiirgen Heringer', 'Heringer')]
---
text: Richard Hudson. 1991.
authors: Richard Hudson
year: 1991
names: [('Richard Hudson', 'Hudson')]
---
text: Arvi Hurskainen. 1996.
authors: Arvi Hurskainen
year: 1996
names: [('Arvi Hurskainen', 'Hurskainen')]
---
text: Time J~rvinen. 1994.
authors: Time J~rvinen
year: 1994
names: [('Time J~rvinen', 'J~rvinen')]
---
text: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors. 1995.
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
year: 1995
names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
---
text: Fred Karlsson. 1990.
authors: Fred Karlsson
year: 1990
names: [('Fred Karlsson', 'Karlsson')]
---
text: Michael McCord. 1990.
authors: Michael McCord
year: 1990
names: [('Michael McCord', 'McCord')]
---
text: Igor A. Mel'~uk. 1987.
authors: Igor A. Mel'~uk
year: 1987
names: [("Igor A. Mel'~uk", "Mel'~uk")]
---
text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
---
text: Daniel Sleator and Davy Temperley. 1991.
authors: Daniel Sleator and Davy Temperley
year: 1991
names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
---
text: Pasi Tapanainen and Time J/irvinen. 1994.
authors: Pasi Tapanainen and Time J/irvinen
year: 1994
names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
---
text: Pasi Tapanainen. 1996.
authors: Pasi Tapanainen
year: 1996
names: [('Pasi Tapanainen', 'Tapanainen')]
---
text: Lucien TesniSre. 1959.
authors: Lucien TesniSre
year: 1959
names: [('Lucien TesniSre', 'TesniSre')]
---
text: Atro Voutilainen. 1995.
authors: Atro Voutilainen
year: 1995
names: [('Atro Voutilainen', 'Voutilainen')]
---
Result for counting:
--- counting ---
authors: Arto Anttila
year: 1995
names: [('Arto Anttila', 'Anttila')]
et al.: False
and : False
---
1 Anttila
0 Anttila, 1995
1 Anttila (1995)
----------
authors: Dekang Lin
year: 1996
names: [('Dekang Lin', 'Lin')]
et al.: False
and : False
---
4 Lin
0 Lin, 1996
1 Lin (1996)
----------
authors: Jason M. Eisner
year: 1996
names: [('Jason M. Eisner', 'Eisner')]
et al.: False
and : False
---
2 Eisner
2 Eisner, 1996
0 Eisner (1996)
----------
authors: David G. Hays
year: 1964
names: [('David G. Hays', 'Hays')]
et al.: False
and : False
---
1 Hays
1 Hays, 1964
0 Hays (1964)
----------
authors: Hans Jiirgen Heringer
year: 1993
names: [('Hans Jiirgen Heringer', 'Heringer')]
et al.: False
and : False
---
1 Heringer
1 Heringer, 1993
0 Heringer (1993)
----------
authors: Richard Hudson
year: 1991
names: [('Richard Hudson', 'Hudson')]
et al.: False
and : False
---
2 Hudson
1 Hudson, 1991
1 Hudson (1991)
----------
authors: Arvi Hurskainen
year: 1996
names: [('Arvi Hurskainen', 'Hurskainen')]
et al.: False
and : False
---
1 Hurskainen
0 Hurskainen, 1996
1 Hurskainen (1996)
----------
authors: Time J~rvinen
year: 1994
names: [('Time J~rvinen', 'J~rvinen')]
et al.: False
and : False
---
2 J~rvinen
1 J~rvinen, 1994
1 J~rvinen (1994)
----------
authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
year: 1995
names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
et al.: True
and : False
---
3 Karlsson
0 Karlsson, 1995
0 Karlsson (1995)
2 Karlsson et al.
1 Karlsson et al., 1995
1 Karlsson et al. (1995)
----------
authors: Fred Karlsson
year: 1990
names: [('Fred Karlsson', 'Karlsson')]
et al.: False
and : False
---
3 Karlsson
0 Karlsson, 1990
1 Karlsson (1990)
----------
authors: Michael McCord
year: 1990
names: [('Michael McCord', 'McCord')]
et al.: False
and : False
---
1 McCord
1 McCord, 1990
0 McCord (1990)
----------
authors: Igor A. Mel'~uk
year: 1987
names: [("Igor A. Mel'~uk", "Mel'~uk")]
et al.: False
and : False
---
1 Mel'~uk
0 Mel'~uk, 1987
1 Mel'~uk (1987)
----------
authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
year: 1996
names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
et al.: True
and : False
---
1 Samuelsson
0 Samuelsson, 1996
0 Samuelsson (1996)
1 Samuelsson et al.
0 Samuelsson et al., 1996
1 Samuelsson et al. (1996)
----------
authors: Daniel Sleator and Davy Temperley
year: 1991
names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
et al.: True
and : True
---
1 Sleator
0 Sleator, 1991
0 Sleator (1991)
0 Sleator et al.
0 Sleator et al., 1991
0 Sleator et al. (1991)
0 Sleator and Temperley
0 Sleator and Temperley, 1991
0 Sleator and Temperley (1991)
----------
authors: Pasi Tapanainen and Time J/irvinen
year: 1994
names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
et al.: True
and : True
---
6 Tapanainen
0 Tapanainen, 1994
0 Tapanainen (1994)
0 Tapanainen et al.
0 Tapanainen et al., 1994
0 Tapanainen et al. (1994)
0 Tapanainen and J/irvinen
0 Tapanainen and J/irvinen, 1994
0 Tapanainen and J/irvinen (1994)
----------
authors: Pasi Tapanainen
year: 1996
names: [('Pasi Tapanainen', 'Tapanainen')]
et al.: False
and : False
---
6 Tapanainen
3 Tapanainen, 1996
0 Tapanainen (1996)
----------
authors: Lucien TesniSre
year: 1959
names: [('Lucien TesniSre', 'TesniSre')]
et al.: False
and : False
---
0 TesniSre
0 TesniSre, 1959
0 TesniSre (1959)
----------
authors: Atro Voutilainen
year: 1995
names: [('Atro Voutilainen', 'Voutilainen')]
et al.: False
and : False
---
3 Voutilainen
0 Voutilainen, 1995
1 Voutilainen (1995)
----------