Search code examples
pythonregexbiopython

Parsing titles from Entrez search result using biopython


I am trying to search for papers with specific words in the title using biopython. More precisely, the word viral or virus in papers published between 2010 and 2015. Here is the code I have:

import re
from Bio import Medline

handle = Entrez.esearch(db="pubmed",  # database to search
                    term="2010[Date - Publication]:2015[Date - Publication]"
                    )
record = Entrez.read(handle)
handle.close()

pmid_list = record["IdList"] #list of records

handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline",     retmode="text")
records = Medline.parse(handle)

titles = [] # start with empty list of titles
for record in records:
    ti_list = record['TI'] #titles
    for title in ti_list:
        if title == "virus" and title not in titles: #searching viral/virus
        titles.append(title)

print('Publications with viral or virus in the title:')
for record in records:
    print(" ", title)

If I simply print(record['TI'], then I get a list of all titles in my search query. However, I'm not able to search the specific word. I think my mistake may be in the "if title == "virus" (because obviously no paper will be titled with that word alone).

I am pretty stuck. Is there a better way to be searching for this word in the titles of the papers I've queried?

Thanks.

Edit: Updated code with re.search (and still no luck)

r = re.compile(r"\bvir(al|us)\b")
titles = set()  # start with empty list of titles
for record in records:
    ti_list = record['TI']  # titles
    for title in ti_list:
        if r.search(title):  #
            titles.add(title)

print('Publications with viral or virus in the title:')
for record in records:
     print(" ", title)

New code:

import re
from Bio import Medline

handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text", 
                       term="2010[Date - Publication]:2015[Date - Publication]")
titles = []
for record in Medline.parse(handle):
    for title in record['TI']:
        titles.append(title)
handle.close()
for title in titles:
    print(title)

Solution

  • If you want to match substrings use in to see if any of the words are contained in the title:

    words  = ("viral","virus")
    if any(w in title for w in words) and title not in titles: #
    

    But you seem to want to filter the records getting any record title that contains viral or virus:

    st  = {"viral","virus"}
    
    filtered_records = [ record for record in records if any(w in st for w in record['TI'] )]
    

    If you want to match substrings and use a pattern then you actually need to make it a regex, "vir(al|us)" is just a string in your code:

    import re
    
    r = re.compile("vir(al|us)")
    filtered_records = [record for record in records if any(r.search(w) for w in record['TI'])]
    

    The regex in your own loop would go where your if is:

    import re
    
    r = re.compile(r"vir(al|us)")
    if r.search(title) and title not in titles: 
          .......
    

    If you don't want viruses etc.. to match then use a word boundary for your regex:

    r = re.compile(r"\bvir(al|us)\b")
    

    You should also make titles a set which cannot have dupes, a working example using your own code:

    r = re.compile(r"\bvir(al|us)\b")
    titles = set()  # start with empty list of titles
    for record in records:
        ti_list = record['TI']  # titles
        for title in ti_list:
            if r.search(title):  #
                titles.add(title)
    

    Which can become a set comprehension:

    r = re.compile(r"\bvir(al|us)\b")
    
    titles = {title for record in records for title in record['TI']  if r.search(title)} # titles
    

    Since record['TI'] returns a string and not a list:

    r = re.compile(r"\bvir(al|us)\b")
    titles = set() 
    for record in records:
        title = record['TI']  # title is a str not a list
        if r.search(title):  #
               titles.add(title)
    

    Do the same with the set comp or any other example.