I work with very large data sets (1.5gb+) and do partial string searches on it.
I was able to write a script for my work, but it takes too long:
fhand = open('C:/Users/promotor/Documents/tce-sagres/TCE-PB-SAGRES-Empenhos_Esfera_Municipal.txt','r')
pergunta = raw_input('Pesquisa: ')
fresult = open('resultado.csv','w')
for line in fhand :
#linha = linha + 0.001
#update_progress(int(linha)*1000)
if pergunta in line :
print line
fresult.write(line)
print "terminado."""
I was wondering if there would be a faster way to do that on Pandas. I tried str.contains, but I could only search on a column. I was wondering if there would be a faster way. I tried "str.contains" but I could only search on only one column.
Best regards.
You are iterating over a for loop and this is what is probably taking a lot of time. I recommend reading the whole file as a string and then using regex to match your pattern.
Try the following code,
import re
with open(your_file_name,'r') as f:
lines=f.read()
name = input('pattern :')
pattern_to_match = r'(?<=\n).*%s.*(?=\n)'%name
matched_pattern = re.findall(pattern_to_match, lines, re.IGNORECASE)
print (matched_pattern)