Search code examples
pythontexttext-extraction

Extract Number of pages from a text column


I have a text column which contains comments like:

  1. 6 pages, LaTeX, no figures
  2. 19 pages, latex, 4 figures as uuencoded postscript files
  3. Invited Talk at the ``VII Marcel Grossman Meeting on General Relativity'' - Stanford, July 1994. 14 pages, latex, five figures, which will be available upon request.
  4. 15 pp. Phyzzx

I am looking to extract the number of pages from this. There are also some rows which do not have any comments or don't have the info related to pages. So those should probably be NA.


Solution

  • This works as long as there is only one number of pages per comment.

    import re
    comments = [
    "6 pages, LaTeX, no figures",
    "112 cucumber",
    "19 pages, latex, 4 figures as uuencoded postscript files",
    "Invited Talk at the ``VII Marcel Grossman Meeting on General 
    Relativity'' - Stanford, July 1994. 14 pages, latex, five figures, 
    which will be available upon request.",
    '15 pp. Phyzzx']
    
    def page_num_extract(text:list) -> list:
      out = []
      for line in text:
        pages = re.findall("\d* pages|\d* pp\.", line)
        pages = re.findall("\d*", str(*pages))[0]
        if not pages:
          pages = "NA"
        out.append(pages)
      return out
    

    page_num_extract(comments)

    ['6', 'NA', '19', '14', '15']