Search code examples
pythonregexdoi

Extract doi (digital object identifier) from text


I have a text block, and thousands more, that contains references to some studies. One of the samples looks as:

txt = '<div>1. <em>Nationella riktlinjer för rörelseorganens sjukdomar</em> (Swedish National Guidelines). 2012, The National Board of Health and Welfare. doi:10.1097/BRS.0b013e31829ff095 https://www.socialstyrelsen.se/publikationer2012/2012-5-1</a></div><div>2. Jevsevar, D.S., et al., <em>The American Academy of Orthopaedic Surgeons evidence-based guideline on: treatment of osteoarthritis of the knee, 2nd edition.</em> J Bone Joint Surg Am, 2013. <strong>95</strong>(20): p. 1885-6. <a href="http://www.ncbi.nlm.nih.gov/pubmed/24288804" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/24288804</a></div><div>3. Namba, R.S., et al., <em>Obesity and perioperative morbidity in total hip and total knee arthroplasty patients.</em> J Arthroplasty, 2005. <strong>20</strong>(7 Suppl 3): p. 46-50. <a href="https://dx.doi.org/10.1016/j.arth.2005.04.023" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1016/j.arth.2005.04.023</a></div><div>4. Peter, W.F., et al., <em>Physiotherapy in hip and knee osteoarthritis: development of a practice guideline concerning initial assessment, treatment and evaluation.</em> Acta Reumatol Port, 2011. <strong>36</strong>(3): p. 268-81. <a href="http://www.ncbi.nlm.nih.gov/pubmed/22113602" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">http://www.ncbi.nlm.nih.gov/pubmed/22113602</a></div><div>5. Santoso, M.B. and L. Wu, <em>Unicompartmental knee arthroplasty, is it superior to high tibial osteotomy in treating unicompartmental osteoarthritis? A meta-analysis and systemic review.</em>&nbsp;J Orthop Surg Res, 2017. <strong>12</strong>(1): p. 50.&nbsp;<a href="https://dx.doi.org/10.1186/s13018-017-0552-9" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://dx.doi.org/10.1186/s13018-017-0552-9</a></div><div>6. Management of osteoarthritis. NICE guidelines. NICE Pathway last updated: 22 January 2019. <a href="https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf" rel="noopener noreferrer" target="_blank" style="color: rgb(220, 161, 13);">https://pathways.nice.org.uk/pathways/osteoarthritis/management-of-osteoarthritis.pdf</a></div><div>&nbsp;</div>'

The text contains several links and keys to doi. How can I get all of those, perhaps in a list such as

['doi:10.1097/BRS.0b013e31829ff095',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1016/j.arth.2005.04.023',
'https://dx.doi.org/10.1186/s13018-017-0552-9',
]

I have looked up for several regular expressions for the same but to no avail. Such as:

import re
exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
pattern = re.compile(exp)

pattern.findall(txt)

This returns an empty list.


Solution

  • Thanks to @wiktor-stribiżew, I got it working.

    exp = "10.\\d{4,9}/[-._;()/:a-z0-9A-Z]+"
    pattern = re.compile(exp)
     
    print( pattern.findall(txt) )
    
    ['10.1097/BRS.0b013e31829ff095', '10.1016/j.arth.2005.04.023', '10.1016/j.arth.2005.04.023', '10.1186/s13018-017-0552-9', '10.1186/s13018-017-0552-9']