Search code examples
pythoncsvwildcard

CSV Wildcard within string


I am using a CSV file to read in terms that I am looking for in a text file.

I would like to use a wildcard or 'Like' as a term. I am not looking for wildcard for text document name but for the terms I'm searching for found in the CSV file.

For example: Terms in CSV file to search in Text File, each term is in it's own row.

test* #search for test, tests, testing, etc project
list
table
chair

Is there a wildcard that can be used in the CSV file so that all variants of that word is returned? I want to place the wildcard in the CSV file.

Below is my code, the file it reads the terms I'm searching for is contract_search_terms.csv

def main():
    txt_filepaths = glob.glob("**/*.txt", recursive=True)
    start = time.time()
    results = {}
    new_results = [] #place dictionary values organized per key = value instead of key = tuple of values 
    term_filename = open('contract_search_terms.csv', 'r') #file where terms to be searched is found
    term_file = csv.DictReader(term_filename)
    search_terms =[] #append terms into a list, this means that we can use several columns and append them all to one list.
    
    #############search for the terms imported into the list############################################################
    for col in term_file:
                
        search_terms.append(col['Contract Terms']) #indicate what columns you want read in

    print(search_terms) #this is just a check to show what terms are in the list
    
    for filepath in txt_filepaths:
        print(f"Searching document {filepath}")   #print what file the code is reading
        
        search_terms = search_terms #place terms list into search_terms so that the code below can read it when looping through the contracts.
                    
        filename = os.path.basename(filepath)
        found_terms = {} #dictionary of the terms found
        
        line_number={}
        
        
        for term in search_terms:
            if term in found_terms.keys():
                continue
                
            with open(filepath, "r", encoding="utf-8") as fp:
                lines = str(fp.readlines()).split('.') #turns contract file lines as a list
                
                for line in lines:
                    if line.find(term) != -1: #line by line is '-1', paragraph '\n'
                        line_number = lines.index(line)
                        new_results.append(f"'{term}' New_Column '{filename}' New_Column '{line}' New_Column '{line_number}'") #placing the results from the print statement below into a list
                        print(f"Found '{term}' in document '{filename}' in line '{line_number}'") 

                    if term in results.keys():
                        pages = results[''.join(term)].append([filename,line,line_number])

                    else:
                        results[term] = [filename]

                #Place results into dataframe and create a csv file to use as a check if results_reports is not correct
                d2=pd.DataFrame(new_results, columns=['Results']) #passing the list to a dataframe and giving it a column title
                d2.to_csv('results.csv', index=True) 

Solution

  • In order to simplify the code flow without testing the term for a wildcard and then branching to another kind of searching I suggest you generally use the regular expression module (available in standard Python distribution):

    import re
    

    and the regular expression patterns as terms for the regular expression search:

    lst_found_terms = re.findall(term, line)
    if lst_found_terms != []: 
        for found_term in lst_found_terms: 
    

    instead of:

    if line.find(term) != -1:
    

    If you look exactly for 'test' the regex pattern will be simply the same as in the find() function ( i.e. 'test' ) and if you want to find all words beginning with 'test' the pattern will be r'\btest\w*'.

    In other words the 'wildcard' for any TERM ending will be enclosing the term with the prefix r'\b' and the ending r'\w*' (stored in CSV as: \bTERM\w*).

    Regular expressions provide the possibility to perform case insensitive search if you use the parameter flags=re.I in re.findall().

    The simple condition proposed in another answer if 'test' in line: will evaluate to True also for 'attestation' or 'contest'. To avoid this the 'wildcard' of the regex sets a word boundary at the beginning of the term ( r'\b' ).

    Notice that not limiting the number of extension characters for 'test' will find with the wildcard also 'testosterone'. You can limit the number of extending characters replacing * with {0,3} for maximal 3 additional characters (covering tests and testing but not testosterone).