Search code examples
stringsearchpandasrowpartial

Best solution on partial string search with Pandas


I work with very large data sets (1.5gb+) and do partial string searches on it.

I was able to write a script for my work, but it takes too long:

fhand = open('C:/Users/promotor/Documents/tce-sagres/TCE-PB-SAGRES-Empenhos_Esfera_Municipal.txt','r')
pergunta = raw_input('Pesquisa: ')
fresult = open('resultado.csv','w')
for line in fhand :
    #linha = linha + 0.001 
    #update_progress(int(linha)*1000)
    if pergunta in line : 
        print line
        fresult.write(line)  
print "terminado."""

I was wondering if there would be a faster way to do that on Pandas. I tried str.contains, but I could only search on a column. I was wondering if there would be a faster way. I tried "str.contains" but I could only search on only one column.

Best regards.


Solution

  • You are iterating over a for loop and this is what is probably taking a lot of time. I recommend reading the whole file as a string and then using regex to match your pattern.

    Try the following code,

    import re
    with open(your_file_name,'r') as f:
        lines=f.read()
    name = input('pattern :')
    pattern_to_match = r'(?<=\n).*%s.*(?=\n)'%name
    matched_pattern = re.findall(pattern_to_match, lines, re.IGNORECASE)
    print (matched_pattern)