Search code examples
pythonstringletter

Python - removing repeated letters in a string


Say I have a string in alphabetical order, based on the amount of times that a letter repeats.

  • Example: "BBBAADDC".

There are 3 B's, so they go at the start, 2 A's and 2 D's, so the A's go in front of the D's because they are in alphabetical order, and 1 C. Another example would be CCCCAAABBDDAB.

Note that there can be 4 letters in the middle somewhere (i.e. CCCC), as there could be 2 pairs of 2 letters.

However, let's say I can only have n letters in a row. For example, if n = 3 in the second example, then I would have to omit one "C" from the first substring of 4 C's, because there can only be a maximum of 3 of the same letters in a row.

Another example would be the string "CCCDDDAABC"; if n = 2, I would have to remove one C and one D to get the string CCDDAABC

Example input/output:

  1. n=2: Input: AAABBCCCCDE, Output: AABBCCDE
  2. n=4: Input: EEEEEFFFFGGG, Output: EEEEFFFFGGG
  3. n=1: Input: XXYYZZ, Output: XYZ

How can I do this with Python? Thanks in advance!

This is what I have right now, although I'm not sure if it's on the right track. Here, z is the length of the string.

for k in range(z+1):
        if final_string[k] == final_string[k+1] == final_string[k+2] == final_string[k+3]: 
            final_string = final_string.translate({ord(final_string[k]): None})
return final_string

Solution

  • Ok, based on your comment, you're either pre-sorting the string or it doesn't need to be sorted by the function you're trying to create. You can do this more easily with itertools.groupby():

    import itertools
    
    def max_seq(text, n=1):
        result = []
        for k, g in itertools.groupby(text):
            result.extend(list(g)[:n])
        return ''.join(result)
    
    
    max_seq('AAABBCCCCDE', 2)
    # 'AABBCCDE'
    max_seq('EEEEEFFFFGGG', 4)
    # 'EEEEFFFFGGG'
    max_seq('XXYYZZ')
    # 'XYZ'
    max_seq('CCCDDDAABC', 2)
    # 'CCDDAABC'
    

    In each group g, it's expanded and then sliced until n elements (the [:n] part) so you get each letter at most n times in a row. If the same letter appears elsewhere, it's treated as an independent sequence when counting n in a row.


    Edit: Here's a shorter version, which may also perform better for very long strings. And while we're using itertools, this one additionally utilises itertools.chain.from_iterable() to create the flattened list of letters. And since each of these is a generator, it's only evaluated/expanded at the last line:

    import itertools
    
    def max_seq(text, n=1):
        sequences = (list(g)[:n] for _, g in itertools.groupby(text))
        letters = itertools.chain.from_iterable(sequences)
        return ''.join(letters)