Search code examples
pythonregexpython-3.xfindall

Python's Regex findall Does not return All matches of Unicode Text


I have a unicode text that contains a list of journals with some details about each. I would like retrieve the name of the journals only.

My text is very large and looks like this:

6) 6. ACROSS LANGUAGES AND CULTURES Semiannual ISSN: 1585-1923 AKADEMIAI KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY, H-1117 Social Sciences Citation Index Arts & Humanities Citation Index 7) 7. ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE ANALYTICAL TR ADITION Quarterly ISSN: 0353-5150 SPRINGER, 233 SPRING ST, NEW YORK, USA, NY, 10013 Arts & Humanities Citation Index 8) 8. ACTA ARCHAEOLOGICA Annual ISSN: 0065-101X WILEY, 111 RIVER ST, HOBOKEN, USA, NJ, 07030-5774 Arts & Humanities Citation Index 9) 9. ACTA BOREALIA Semiannual ISSN: 0800-3831 ROUTLEDGE JOURNALS, TAYLOR & FRANCIS LTD, 2-4 PARK SQUARE, MILTON PARK, ABINGDON, ENGLAND, OXON, OX14 4RN Arts & Humanities Citation Index 10) 10. ACTA CLASSICA Annual ISSN: 0065-1141 UNIV FREE STATE, DEPT ENG CLASSICAL LANG, PO BOX 339, BLOEMFONTEIN, SOUTH AFRICA, 9300 Arts & Humanities Citation Index 11) 11. ACTA HISTORICA TALLINNENSIA Annual ISSN: 1406-2925 ESTONIAN ACADEMY PUBLISHERS, 6 KOHTU, TALLINN, ESTONIA, 10130 Arts & Humanities Citation Index 12) 12. ACTA HISTRIAE Tri-annual ISSN: 1318-0185 4 تاریخ انتشار: 89/2/62 پژوهشگاه و شبکه آزمایشگاهی 98/3 :Code UNIV PRIMORSKA, SCI RES CENTRE KOPER, GARIBALDIJEVA 1, KOPER, SLOVENIA, CAPODISTRIA, SI-6000 Social Sciences Citation Index Arts & Humanities Citation Index 13) 13. ACTA KOREANA Semiannual ISSN: 1520-7412 ACADEMIA KOREANA KEIMYUNG UNIV, 1095 DALGUBEOLDAERO, DALSEO-GU, DAEGU, SOUTH KOREA, 704-701 Arts & Humanities Citation Index Current Contents - Arts & Humanities 14) 14. ACTA LINGUISTICA HUNGARICA Quarterly ISSN: 1216-8076 AKADEMIAI KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY, H-1117 Social Sciences Citation Index Arts & Humanities Citation Index 15)15. ACTA LITERARIA Semiannual ISSN: 0717-6848 UNIV CONCEPCION, FAC HUMANIDADES ARTE, CASILLA 160-C, CORREO 3, CONCEPCION, CHILE, 00000 Arts & Humanities Citation Index 16) 16. ACTA MUSICOLOGICA Semiannual ISSN: 0001-6241 INT MUSICOLOGICAL SOC, BOX 561, BASEL, SWITZERLAND, CH-4001 Arts & Humanities Citation Index Current Contents - Arts & Humanities 17) 17. ACTA ORIENTALIA ACADEMIAE SCIENTIARUM HUNGARICAE Quarterly ISSN: 1588-2667 AKADEMIAI KIADO ZRT, BUDAFOKI UT 187-189-A-3, BUDAPEST, HUNGARY, H-1117 Arts & Humanities Citation Index 5 تاریخ انتشار: 89/2/62 پژوهشگاه و شبکه آزمایشگاهی 98/3 :Code Current Contents - Arts & Humanities 18) 18. ACTA PHILOSOPHICA Semiannual ISSN: 1121-2179 FABRIZIO SERRA EDITORE, PO BOX NO,1, SUCC NO. 8, PISA, ITALY, I-56123 Arts & Humanities Citation Index Current Contents - Arts & Humanities

It want the match return

ACROSS LANGUAGES AND CULTURES Semiannual

ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE ANALYTICAL TR ADITION Quarterly

ACTA ARCHAEOLOGICA Annual

etc.

I have already tried (https://regex101.com/r/eyafNd/1) and on reg101 website, it seems it works.

regex = r"^(\d+\)\s*\d+\.\s+)(.*?) ISSN"
l = re.findall(regex,txt,re.IGNORECASE)
print(len(l))
print(l)

What it return is list with only 1 result as follows

[('6) 6. ', 'ACROSS LANGUAGES AND CULTURES Semiannual')]

Any help would be appreciated.

CS


Solution

  • Maybe take a look at this regex:

    (?<=\d\.\s).+?(?=\sISSN)
    

    Regex Demo

    regex = r"(?<=\d\.\s).+?(?=\sISSN)"
    l = re.findall(regex, txt, re.I)
    print(len(l))
    print(l)
    

    This says to start matching following a number+dot+whitespace and up to the characters whitespace+ISSN. I can then confirm that when I write your text, I receive the following output list with your code:

    ['ACROSS LANGUAGES AND CULTURES Semiannual', 'ACTA ANALYTICA-INTERNATIONAL PERIODICAL FOR PHILOSOPHY IN THE ANALYTICAL TR ADITION Quarterly', 'ACTA ARCHAEOLOGICA Annual'...]