Search code examples
pythonpython-3.xlistsublist

Get sublists of fixed size


Title is not clear but here's what I want to do.

I have a genomic chain:

corpus_2 = ['TCAATCAC', 'GGGGGGGGGGG', 'AAAA']

I want to extract all sublists of a fixed size. Let's say I want sublists of size 4.

Example of result I look for : ['TCAA', 'CAAT', 'AATC', 'ATCA', 'TCAC'], ['GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG'], ['AAAA']]

We take a sublist of index 0 up to index 3, then add a new string etc...

Here is my code :

ngram_size=4
corpus=['TCAA', 'CAAT', 'AATC', 'ATCA', 'TCAC'], ['GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG', 'GGGG'], ['AAAA']]
decoliste=[] #list output
        listemp=[] # I add one list by one list, each of these list corresponds to a list in input list.
        for element in self.corpus:
#             print(element)
            decoliste.append(listemp)
            listemp=[]

            for i in range(len(element)):
                try:
                    if len(element[i:i+self.ngram_size])==self.ngram_size:

                        listemp.append((element[i:i+self.ngram_size]))
                except:
                    pass
        decoliste.append(listemp)

        del(decoliste[0])
        print(decoliste)

I wanted to know if you could give me hints on how to drastically improve this code (it's really long and teacher is not going to like it).


Solution

  • For each string, you can go over all the indexes between 0 and it's length minus the ngram_size plus one and get a substring starting at that index. Putting this all together using list comprehensions actually makes it pretty elegant:

    result = [[e[i:i + ngram_size] for i in range(len(e) + 1 - ngram_size)] for e in corpus_2]