Search code examples
pythonpython-3.xlistdata-sciencetext-parsing

Parsing specific region of a txt, comparing to list of strings, then generating new list composed of matches


I am trying to do the following:

  1. Read through a specific portion of a text file (there is a known starting point and ending point)
  2. While reading through these lines, check to see if a word matches a word that I have included in a list
  3. If a match is detected, then add that specific word to a new list

I have been able to read through the text and grab other data from it that I need, but I've been unable to do the above mentioned thus far.

I have tried to implement the following example: Python - Search Text File For Any String In a List But I have failed to make it read correctly.

I have also tried to adapt the following: https://www.geeksforgeeks.org/python-finding-strings-with-given-substring-in-list/ But I was equally unsuccessful.

Here is some of my code:

import re
from itertools import islice
import os

# list of all countries
oneCountries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica,, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
countries = oneCountries.split(",")

path = "C:/Users/me/Desktop/read.txt"
thefile = open(path, errors='ignore')

countryParsing = False
for line in thefile:
    line = line.strip()
#    if line.startswith("Submitting Author:"):
#    if re.match(r"Submitting Author:", line):
#        print("blahblah1")
#        countryParsing = True
#        if countryParsing == True:
#            print("blahblah2")
#            
#            res = [x for x in line if re.search(countries, x)]
#            print("blah blah3: " + str(res))
#    elif re.match(r"Running Head:", line):
#        countryParsing = False
#    if countryParsing == True:
#        res = [x for x in line if re.search(countries, x)]
#        print("blah blah4: " + str(res))


#        for x in countries:
#            if x in thefile:
#                print("a country is: " + x)
#        if any(s in line for s in countries):
#            listOfAuthorCountries = listOfAuthorCountries + s + ", "
#    if re.match(f"Submitting Author:, line"):

The #commented out lines are versions of the code that I've tried and failed to make work properly.

As requested, this is an example of the text file that I'm trying to grab the data from. I've modified it to remove sensitive information, but in this particular case, the "new list" should be appended with a certain number of "France" entries:

    txt above....
Submitting Author:

    asdf, asdf  (proxy)
    France
    asdfasdf
    blah blah
    asdfasdf

    asdf, Provence-Alpes-Côte d'Azu 13354
    France

    blah blah
    France
    asdf
Running Head:
    ...more text below

Solution

  • Based on the three points you stated on what you want to accomplish and what I understand from your code (which may not be what you intended), I propose:

    # list of all countries
    countries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
    countries = countries.split(",")
    countries = [c.strip() for c in countries]
    
    filename = "read.txt"
    filehandle = open(filename, errors='ignore')
    my_other_list = []
    toParse = False
    for line in filehandle:
        line = line.strip()
        if line.startswith("Submitting Author:"):
            toParse = True
            continue
        elif line.startswith("Running Head:"):
            toParse = False
            continue
        elif toParse:
            for c in countries:
                if c in line:
                    my_other_list.append(c)
    

    EDIT SUMMARY

    1. Adapted code to work on the text sample provided.

    2. Fixed the list of countries (originally there were two commas after Costa Rica).