Search code examples
pythonlistbibliography

Get bibliography list and its count from text - Python


In my python task, I've to read a PDF paper and get all the references with their count (mentioned in paper). This is the PDF as example and it has 18 references and say Ref#1 is mentioned in paper for like 3 times and Ref#2 is referred 1 times so this is how I want;

Ref#  Count   Reference 
 1     3      Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358.
 2      1     Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John arroll, editor, Workshop on Robust Parsing, pages 54-69, Prague
 ...

I'm done with Ref # and References in a list, and somehow managed to get lines from text having Reference in them by using this regex:

regex = re.compile(r'[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}\)|\([A-Z]{1}[a-z\u0000-\u007F]+, [0-9]{4}; [A-Za-z \u0000-\u007F,;]*\)|[A-Z]{1}[a-z\u0000-\u007F]+ \([0-9]{4},[A-Za-z0-9\u0000-\u007F ]*\)|[A-Z]{1}[a-z\u0000-\u007F ]+ [a-z]{2} [a-z]{2}. \([0-9]{4}\)')

So when I traverse list of String (Text splitted by sentences) and find by upper regex using this code:

for i in range(0, len(lstString)):
    refLine = re.findall(regex, lstString[i])
    if(refLine != [] and refLine [0] != []):
        print(refLine)

I get some output like this:

    (Karls- son et al., 1995)
    Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson
(1990)
    (Tapanainen, 1996)
    (Tapanainen, 1996) is dif- ferent from the former (Karlsson et al., 1995)
    Hurskainen (1996)
    In essence, the same formalism is used in the syn- tactic analysis in J~rvinen (1994) and     Anttila (1995)
    Our notation follows the classical model of depen- dency theory (Heringer, 1993) introduced by Lucien Tesni~re (1959) and later
advocated by Igor Mel'~uk (1987)
    Hudson (1991)
    (Hays, 1964)
    (McCord, 1990; Sleator and Tem- perley, 1991; Eisner, 1996)
    (Hudson, 1991)
    (J~irvinen, 1994)
    The CG-2 program (Tapanainen, 1996) runs a mod- ified disambiguation grammar of Voutilainen (1995)
    (J~rvinen, 1994; Tapanainen and J/~rvinen, 1994)
    (Eisner, 1996)
    Dekang Lin (1996)
    Acknowledgments We are using Atro Voutilainen's (1995)

It returns me all strings having References in them but I got some issues like

  1. It is not capturing Reference like this Karlsson et al. (1995)
  2. Some of these contains 2 reference in them
  3. How can I update count for each reference in reference list

I tried this code to get count for each ref but it always returns the whole list;

matching = [s for s in lstRef if any(xs in s for xs in refLine)]

Any Kind of help will be appreciated.


Solution

  • I was wondering what if to get names (and years) from References at the end of document and use them to search references in document.

    In previous question you get code which gets References at the end of document.

    Using regex '((.*)\. (\d{4})\. I can get names as one string, year as one string (and eventually both in one string)

        authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
        text, authors, year = authors_and_year.groups()
    

    ie.

       text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
    authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
       year: 1996
    

    Using next regex ',[ ]*and |,[ ]*| and ' I can split string with names into list of names

        names = re.split(',[ ]*and |,[ ]*| and ', authors)
    

    and using normal split(" ") I can get surnames (last names) which can be more useful then full name

        names = [(name, name.split(' ')[-1]) for name in names]
    

    ie.

    names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
    

    And now I can use these names (or rather surnames) and years to generate strings like surname (year), surname, year and search then in document.

    If there are many surnames then I can get first surname adn generate surname et al. (year), etc.

    And using these string and starndard string function text.count(generated_string) I can count them.

    At this moment it is all what I have but It is still not ideal.

    You could find all references in document manually and use them to test code. And you would see which one are correctly counted and which needs more changes.

    For example there is reference with 's in text We are using Atro Voutilainen's (1995). Maybe document should be cleaned like in NLP (Natural Language Processing) using nltk

    And some native chars make problem - name Järvinen in one place is extracted as J~rvinen and in other place as J/irvinen

    import PyPDF2
    from PyPDF2.pdf import *  # to import function used in origimal `extractText`
    
    # --- functions ---
    
    def myExtractText(self, distance=None):
        # original code from `page.extractText()`
        # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
        
        text = u_("")
    
        content = self["/Contents"].getObject()
    
        if not isinstance(content, ContentStream):
            content = ContentStream(content, self.pdf)
        
        prev_x = 0
        prev_y = 0
        
        for operands, operator in content.operations:
            # used only for test to see values in variables
            #print('>>>', operator, operands)
    
            if operator == b_("Tj"):
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += _text
            elif operator == b_("T*"):
                text += "\n"
            elif operator == b_("'"):
                text += "\n"
                _text = operands[0]
                if isinstance(_text, TextStringObject):
                    text += operands[0]
            elif operator == b_('"'):
                _text = operands[2]
                if isinstance(_text, TextStringObject):
                    text += "\n"
                    text += _text
            elif operator == b_("TJ"):
                for i in operands[0]:
                    if isinstance(i, TextStringObject):
                        text += i
                text += "\n"
                
            if operator == b_("Tm"):
            
                if distance is True: 
                    text += '\n'
                    
                elif isinstance(distance, int):
                    x = operands[-2]
                    y = operands[-1]
    
                    diff_x = prev_x - x
                    diff_y = prev_y - y
    
                    #print('>>>', diff_x, diff_y - y)
                    #text += f'| {diff_x}, {diff_y - y} |'
                    
                    if diff_y > distance or diff_y < 0:  # (bigger margin) or (move to top in next column)
                        text += '\n'
                        #text += '\n' # to add empty line between elements
                        
                    prev_x = x
                    prev_y = y
                
        return text
            
    # --- main ---
            
    pdfFileObj = open('A97-1011.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    text = ''
    
    for page in pdfReader.pages:
        #text += page.extractText()  # original function
        #text += myExtractText(page)        # modified function (works like original version)
        #text += myExtractText(page, True)  # modified function (add `\n` after every `Tm`)
        text += myExtractText(page, 17)  # modified function (add `\n` only if distance is bigger then `17`)   
    
    # get only text after word `References`
    pos = text.lower().find('references')
    
    # only referencers as text
    references = text[pos+len('references '):]
    
    # doc without references
    doc = text[:pos]
    
    # referencers as list
    references = references.split('\n')
    
    # remove empty lines and lines which have 2 chars (ie. page number)
    references = [item.strip() for item in references if len(item.strip()) > 2]
    
    print('\n--- names ---\n')
    
    data = []
    
    for nubmer, line in enumerate(references, 1): # skip last element with page number
        line = line.strip()
        if line:  # skip empty line
        
            authors_and_year = re.match('((.*)\. (\d{4})\.)', line)
            text, authors, year = authors_and_year.groups()
            #print(text, '|', authors, '|', year)
            
            names = re.split(',[ ]*and |,[ ]*| and ', authors)
            #print(names)
            
            # [(name, last_name), ...]
            names = [(name, name.split(' ')[-1]) for name in names]
            #print(names)
            
            #print(' line:', line)
            print('   text:', text)
            print('authors:', authors)
            print('   year:', year)
            print('  names:', names)
            print('---')
            data.append((authors, names, year))
    
    print('\n--- counting ---\n')
    
    # https://guides.lib.monash.edu/citing-referencing/APA-In-text
    # Tapanainen and J/~rvine, 
    
    for authors, names, year in data:
        print('authors:', authors)
        print('   year:', year)
        print('  names:', names)
        print(' et al.:', len(names) > 1)
        print('   and :', len(names) == 2)
        print('---')
        first_lastname = names[0][-1]
        print(doc.count(first_lastname), first_lastname)
        print(doc.count(first_lastname + ', ' + year), first_lastname + ', ' + year)
        print(doc.count(first_lastname + ' (' + year + ')'), first_lastname + ' (' + year + ')')
        
        if len(names) > 1:
            first_lastname_et_al = first_lastname + ' et al.'
            print(doc.count(first_lastname_et_al), first_lastname_et_al)
            print(doc.count(first_lastname_et_al + ', ' + year), first_lastname_et_al + ', ' + year)
            print(doc.count(first_lastname_et_al + ' (' + year + ')'), first_lastname_et_al + ' (' + year + ')')
    
        if len(names) == 2:
            all_lastnames = ' and '.join(item[-1] for item in names)
            print(doc.count(all_lastnames), all_lastnames)
            print(doc.count(all_lastnames + ', ' + year), all_lastnames + ', ' + year)
            print(doc.count(all_lastnames + ' (' + year + ')'), all_lastnames + ' (' + year + ')')
    
        print('----------')
    

    Result for names extracting:

    --- names ---
    
       text: Arto Anttila. 1995.
    authors: Arto Anttila
       year: 1995
      names: [('Arto Anttila', 'Anttila')]
    ---
       text: Dekang Lin. 1996.
    authors: Dekang Lin
       year: 1996
      names: [('Dekang Lin', 'Lin')]
    ---
       text: Jason M. Eisner. 1996.
    authors: Jason M. Eisner
       year: 1996
      names: [('Jason M. Eisner', 'Eisner')]
    ---
       text: David G. Hays. 1964.
    authors: David G. Hays
       year: 1964
      names: [('David G. Hays', 'Hays')]
    ---
       text: Hans Jiirgen Heringer. 1993.
    authors: Hans Jiirgen Heringer
       year: 1993
      names: [('Hans Jiirgen Heringer', 'Heringer')]
    ---
       text: Richard Hudson. 1991.
    authors: Richard Hudson
       year: 1991
      names: [('Richard Hudson', 'Hudson')]
    ---
       text: Arvi Hurskainen. 1996.
    authors: Arvi Hurskainen
       year: 1996
      names: [('Arvi Hurskainen', 'Hurskainen')]
    ---
       text: Time J~rvinen. 1994.
    authors: Time J~rvinen
       year: 1994
      names: [('Time J~rvinen', 'J~rvinen')]
    ---
       text: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors. 1995.
    authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
       year: 1995
      names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
    ---
       text: Fred Karlsson. 1990.
    authors: Fred Karlsson
       year: 1990
      names: [('Fred Karlsson', 'Karlsson')]
    ---
       text: Michael McCord. 1990.
    authors: Michael McCord
       year: 1990
      names: [('Michael McCord', 'McCord')]
    ---
       text: Igor A. Mel'~uk. 1987.
    authors: Igor A. Mel'~uk
       year: 1987
      names: [("Igor A. Mel'~uk", "Mel'~uk")]
    ---
       text: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996.
    authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
       year: 1996
      names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
    ---
       text: Daniel Sleator and Davy Temperley. 1991.
    authors: Daniel Sleator and Davy Temperley
       year: 1991
      names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
    ---
       text: Pasi Tapanainen and Time J/irvinen. 1994.
    authors: Pasi Tapanainen and Time J/irvinen
       year: 1994
      names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
    ---
       text: Pasi Tapanainen. 1996.
    authors: Pasi Tapanainen
       year: 1996
      names: [('Pasi Tapanainen', 'Tapanainen')]
    ---
       text: Lucien TesniSre. 1959.
    authors: Lucien TesniSre
       year: 1959
      names: [('Lucien TesniSre', 'TesniSre')]
    ---
       text: Atro Voutilainen. 1995.
    authors: Atro Voutilainen
       year: 1995
      names: [('Atro Voutilainen', 'Voutilainen')]
    ---
    

    Result for counting:

    --- counting ---
    
    authors: Arto Anttila
       year: 1995
      names: [('Arto Anttila', 'Anttila')]
     et al.: False
       and : False
    ---
    1 Anttila
    0 Anttila, 1995
    1 Anttila (1995)
    ----------
    authors: Dekang Lin
       year: 1996
      names: [('Dekang Lin', 'Lin')]
     et al.: False
       and : False
    ---
    4 Lin
    0 Lin, 1996
    1 Lin (1996)
    ----------
    authors: Jason M. Eisner
       year: 1996
      names: [('Jason M. Eisner', 'Eisner')]
     et al.: False
       and : False
    ---
    2 Eisner
    2 Eisner, 1996
    0 Eisner (1996)
    ----------
    authors: David G. Hays
       year: 1964
      names: [('David G. Hays', 'Hays')]
     et al.: False
       and : False
    ---
    1 Hays
    1 Hays, 1964
    0 Hays (1964)
    ----------
    authors: Hans Jiirgen Heringer
       year: 1993
      names: [('Hans Jiirgen Heringer', 'Heringer')]
     et al.: False
       and : False
    ---
    1 Heringer
    1 Heringer, 1993
    0 Heringer (1993)
    ----------
    authors: Richard Hudson
       year: 1991
      names: [('Richard Hudson', 'Hudson')]
     et al.: False
       and : False
    ---
    2 Hudson
    1 Hudson, 1991
    1 Hudson (1991)
    ----------
    authors: Arvi Hurskainen
       year: 1996
      names: [('Arvi Hurskainen', 'Hurskainen')]
     et al.: False
       and : False
    ---
    1 Hurskainen
    0 Hurskainen, 1996
    1 Hurskainen (1996)
    ----------
    authors: Time J~rvinen
       year: 1994
      names: [('Time J~rvinen', 'J~rvinen')]
     et al.: False
       and : False
    ---
    2 J~rvinen
    1 J~rvinen, 1994
    1 J~rvinen (1994)
    ----------
    authors: Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors
       year: 1995
      names: [('Fred Karlsson', 'Karlsson'), ('Atro Voutilainen', 'Voutilainen'), ('Juha Heikkil~', 'Heikkil~'), ('Arto Anttila', 'Anttila'), ('editors', 'editors')]
     et al.: True
       and : False
    ---
    3 Karlsson
    0 Karlsson, 1995
    0 Karlsson (1995)
    2 Karlsson et al.
    1 Karlsson et al., 1995
    1 Karlsson et al. (1995)
    ----------
    authors: Fred Karlsson
       year: 1990
      names: [('Fred Karlsson', 'Karlsson')]
     et al.: False
       and : False
    ---
    3 Karlsson
    0 Karlsson, 1990
    1 Karlsson (1990)
    ----------
    authors: Michael McCord
       year: 1990
      names: [('Michael McCord', 'McCord')]
     et al.: False
       and : False
    ---
    1 McCord
    1 McCord, 1990
    0 McCord (1990)
    ----------
    authors: Igor A. Mel'~uk
       year: 1987
      names: [("Igor A. Mel'~uk", "Mel'~uk")]
     et al.: False
       and : False
    ---
    1 Mel'~uk
    0 Mel'~uk, 1987
    1 Mel'~uk (1987)
    ----------
    authors: Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen
       year: 1996
      names: [('Christer Samuelsson', 'Samuelsson'), ('Pasi Tapanainen', 'Tapanainen'), ('Atro Voutilainen', 'Voutilainen')]
     et al.: True
       and : False
    ---
    1 Samuelsson
    0 Samuelsson, 1996
    0 Samuelsson (1996)
    1 Samuelsson et al.
    0 Samuelsson et al., 1996
    1 Samuelsson et al. (1996)
    ----------
    authors: Daniel Sleator and Davy Temperley
       year: 1991
      names: [('Daniel Sleator', 'Sleator'), ('Davy Temperley', 'Temperley')]
     et al.: True
       and : True
    ---
    1 Sleator
    0 Sleator, 1991
    0 Sleator (1991)
    0 Sleator et al.
    0 Sleator et al., 1991
    0 Sleator et al. (1991)
    0 Sleator and Temperley
    0 Sleator and Temperley, 1991
    0 Sleator and Temperley (1991)
    ----------
    authors: Pasi Tapanainen and Time J/irvinen
       year: 1994
      names: [('Pasi Tapanainen', 'Tapanainen'), ('Time J/irvinen', 'J/irvinen')]
     et al.: True
       and : True
    ---
    6 Tapanainen
    0 Tapanainen, 1994
    0 Tapanainen (1994)
    0 Tapanainen et al.
    0 Tapanainen et al., 1994
    0 Tapanainen et al. (1994)
    0 Tapanainen and J/irvinen
    0 Tapanainen and J/irvinen, 1994
    0 Tapanainen and J/irvinen (1994)
    ----------
    authors: Pasi Tapanainen
       year: 1996
      names: [('Pasi Tapanainen', 'Tapanainen')]
     et al.: False
       and : False
    ---
    6 Tapanainen
    3 Tapanainen, 1996
    0 Tapanainen (1996)
    ----------
    authors: Lucien TesniSre
       year: 1959
      names: [('Lucien TesniSre', 'TesniSre')]
     et al.: False
       and : False
    ---
    0 TesniSre
    0 TesniSre, 1959
    0 TesniSre (1959)
    ----------
    authors: Atro Voutilainen
       year: 1995
      names: [('Atro Voutilainen', 'Voutilainen')]
     et al.: False
       and : False
    ---
    3 Voutilainen
    0 Voutilainen, 1995
    1 Voutilainen (1995)
    ----------