Title is not clear but here's what I want to do.
I have a genomic chain:
corpus_2 = ['TCAATCAC', 'GGGGGGGGGGG', 'AAAA']
I want to extract all sublists of a fixed size. Let's say I want sublists of size 4.
Example of result I look for : ['TCAA', 'CAAT', 'AATC', 'ATCA', 'TCAC'], ['GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG'], ['AAAA']]
We take a sublist of index 0 up to index 3, then add a new string etc...
Here is my code :
ngram_size=4
corpus=['TCAA', 'CAAT', 'AATC', 'ATCA', 'TCAC'], ['GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG'], ['AAAA']]
decoliste=[] #list output
listemp=[] # I add one list by one list, each of these list corresponds to a list in input list.
for element in self.corpus:
# print(element)
decoliste.append(listemp)
listemp=[]
for i in range(len(element)):
try:
if len(element[i:i+self.ngram_size])==self.ngram_size:
listemp.append((element[i:i+self.ngram_size]))
except:
pass
decoliste.append(listemp)
del(decoliste[0])
print(decoliste)
I wanted to know if you could give me hints on how to drastically improve this code (it's really long and teacher is not going to like it).
For each string, you can go over all the indexes between 0 and it's length minus the ngram_size
plus one and get a substring starting at that index. Putting this all together using list comprehensions actually makes it pretty elegant:
result = [[e[i:i + ngram_size] for i in range(len(e) + 1 - ngram_size)] for e in corpus_2]