Search code examples

Pubmed ID to author list + citation, Python?

I have a list of pubmed ids, and I want to extract a citation with a full author list. There are online tools like this:, but the author list gets abbreviated to "et al."

I am trying to use the Entrez package in Biopython to do this, and as well xml.etree.ElementTree to parse the XML object.

Here is what I have:

from Bio.Entrez import efetch
import xml.etree.ElementTree as ET

def fetch_abstract(pmid):
    handle = efetch(db='pubmed', id=pmid, retmode='xml')
    xml_data =
    print xml_data #this prints the XML data structure correctly

    article = ET.XML(xml_data)

    #problem starts here. I want to create a citation, so I start by trying to
    #get the names of the authors, but I am not sure why this is not working.
    for author_name in article.findall('AuthorValidYN'):
        print author_name



XML looks like this:

<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2014//EN"      "">
    <MedlineCitation Owner="NLM" Status="MEDLINE">
    <PMID Version="1">22864638</PMID>
    <Article PubModel="Print">
            <ISSN IssnType="Electronic">1573-7292</ISSN>
            <JournalIssue CitedMedium="Internet">
            <Title>Familial cancer</Title>
            <ISOAbbreviation>Fam. Cancer</ISOAbbreviation>
        <ArticleTitle>No evidence for breast cancer susceptibility associated with variants of BRD7, a component of p53 and BRCA1 pathways.</ArticleTitle>
        <ELocationID EIdType="doi" ValidYN="Y">10.1007/s10689-012-9556-0</ELocationID>
            <AbstractText>BRD7 (bromodomain 7), a subunit of poly-bromo-associated BRG1-associated factor (PBAF)-specific Swi/Snf chromatin remodeling complexes, has been proposed as a tumour suppressor protein following its identification as an important component of both functional p53 and BRCA1 (breast cancer 1, early onset) pathways. As low BRD7 expression levels have been linked to p53-wild-type breast tumour cells, we hypothesized an implication of BRD7 germline alterations in the pathogenesis of hereditary breast cancer similar to that of TP53 in Li-Fraumeni syndrome. We performed sequence analysis of the BRD7 gene on 61 high-risk individuals with hereditary or very-early-onset breast cancer and 100 healthy controls. Four potentially disease-causing single-nucleotide alterations were detected within the cohort of breast cancer patients (one listed as a rare single-nucleotide polymorphism (SNP) in the NCBI (National Center for Biotechnology Information) SNP database). Two of the detected variants were also each found once within the control collective. Segregation analysis on both families of those carrying the remaining two variants revealed segregation of these BRD7 alterations independent of breast cancer. In conclusion, it seems that the BRD7 variants we detected represent rare polymorphisms and mainly rule out BRD7 as a frequent high-penetrance breast cancer susceptibility gene. However, further analyses in larger cohorts of women with hereditary breast cancer should clarify the role of BRD7 in breast cancer predisposition.</AbstractText>
        <AuthorList CompleteYN="Y">
            <Author ValidYN="Y">
                <Affiliation>Institute of Cell and Molecular Pathology, Hannover Medical School, Carl-Neuberg-Strasse 1, Hannover, Germany.</Affiliation>
            <Author ValidYN="Y">
            <Author ValidYN="Y">
            <Author ValidYN="Y">
            <PublicationType>Comparative Study</PublicationType>
            <PublicationType>Journal Article</PublicationType>
            <PublicationType>Research Support, Non-U.S. Gov't</PublicationType>
        <MedlineTA>Fam Cancer</MedlineTA>
            <NameOfSubstance>BRCA1 Protein</NameOfSubstance>
            <NameOfSubstance>BRCA1 protein, human</NameOfSubstance>
            <NameOfSubstance>BRD7 protein, human</NameOfSubstance>
            <NameOfSubstance>Chromosomal Proteins, Non-Histone</NameOfSubstance>
            <NameOfSubstance>TP53 protein, human</NameOfSubstance>
            <NameOfSubstance>Tumor Suppressor Protein p53</NameOfSubstance>
            <DescriptorName MajorTopicYN="N">Adult</DescriptorName>
            <DescriptorName MajorTopicYN="N">Aged</DescriptorName>
            <DescriptorName MajorTopicYN="N">BRCA1 Protein</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
            <DescriptorName MajorTopicYN="N">Breast Neoplasms</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
            <DescriptorName MajorTopicYN="N">Case-Control Studies</DescriptorName>
            <DescriptorName MajorTopicYN="N">Chromosomal Proteins, Non-Histone</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
            <DescriptorName MajorTopicYN="N">Female</DescriptorName>
            <DescriptorName MajorTopicYN="Y">Genetic Predisposition to Disease</DescriptorName>
            <DescriptorName MajorTopicYN="N">Humans</DescriptorName>
            <DescriptorName MajorTopicYN="N">Male</DescriptorName>
            <DescriptorName MajorTopicYN="N">Middle Aged</DescriptorName>
            <DescriptorName MajorTopicYN="N">Mutation</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
            <DescriptorName MajorTopicYN="N">Pedigree</DescriptorName>
            <DescriptorName MajorTopicYN="N">Polymorphism, Single Nucleotide</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
            <DescriptorName MajorTopicYN="N">Prognosis</DescriptorName>
            <DescriptorName MajorTopicYN="N">Tumor Suppressor Protein p53</DescriptorName>
            <QualifierName MajorTopicYN="Y">genetics</QualifierName>
            <DescriptorName MajorTopicYN="N">Young Adult</DescriptorName>
        <PubMedPubDate PubStatus="entrez">
        <PubMedPubDate PubStatus="pubmed">
        <PubMedPubDate PubStatus="medline">
        <ArticleId IdType="doi">10.1007/s10689-012-9556-0</ArticleId>
        <ArticleId IdType="pubmed">22864638</ArticleId>


  • Here is what I used to do the same thing (using BeautifulSoup instead).

    from BeautifulSoup import BeautifulSoup
    soup = BeautifulSoup(xml_data)
    a_recs = []
    for tag in soup.findAll("pubmedarticle"): # I'm working with multiple articles in one file
        for a_tag in tag.findAll("author"):
            a_rec = {}
            a_rec['pmid'] = int(tag.pmid.text)
            a_rec['lastname'] = a_tag.lastname.text
            a_rec['forename'] = a_tag.forename.text
            a_rec['suffix'] = a_tag.suffix.text
            a_rec['initials'] = a_tag.initials.text
            a_rec['affiliation'] = a_tag.affiliation.text

    A lot of times, different parts of the author's names will be null and you'll get an error if you try to access the text attribute of each element, so you'll need to check for that before directly accessing the text attribute (I wrote a short function to default to None if there is no text attribute of a tag).