I have a text column which contains comments like:
I am looking to extract the number of pages from this. There are also some rows which do not have any comments or don't have the info related to pages. So those should probably be NA.
This works as long as there is only one number of pages per comment.
import re
comments = [
"6 pages, LaTeX, no figures",
"112 cucumber",
"19 pages, latex, 4 figures as uuencoded postscript files",
"Invited Talk at the ``VII Marcel Grossman Meeting on General
Relativity'' - Stanford, July 1994. 14 pages, latex, five figures,
which will be available upon request.",
'15 pp. Phyzzx']
def page_num_extract(text:list) -> list:
out = []
for line in text:
pages = re.findall("\d* pages|\d* pp\.", line)
pages = re.findall("\d*", str(*pages))[0]
if not pages:
pages = "NA"
out.append(pages)
return out
page_num_extract(comments)
['6', 'NA', '19', '14', '15']