Search code examples
pythonsubstringlongest-substring

DNA sequencing using python


Using loops, how can I write a function in python, to sort the longest chain of proteins, regardless of order. The function returns a substring that consists only of the character 'A','C','G', and 'T' when ties are mixed up with other elements: Example, in the sequence: 'ACCGXXCXXGTTACTGGGCXTTGT', it returns 'GTTACTGGGC'


Solution

  • If the data is provided as a string you could simply split it by the character 'X' and thereby get a list.

    startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
    array = startstring.split('X')
    

    Then looping over the list while checking for the length of the element would give you the right result:

    # Initialize placeholders for comparison
    temp_max_string = ''
    temp_max_length = 0
    
    #Loop over each string in the list
    for i in array:
        # Check if the current substring is longer than the longest found so far
        if len(i) > temp_max_length:
            # Replace the placeholders if it is longer
            temp_max_length = len(i)
            temp_max_string = i
    
    print(temp_max_string) # or 'print temp_max_string' if you are using python2.
    

    You could also use the python built-ins to get your result in a more efficient manner:

    Sorting by descending length (list.sort())

    startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
    array = startstring.split('X')
    array.sort(key=len, reverse=True)
    print(array[0]) #print the longest since we sorted for descending lengths
    print(len(array[0])) # Would give you the length of the longest substring
    

    Only get the longest substring (max()):

    startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
    array = startstring.split('X')
    longest = max(array, key=len)
    print(longest) # gives the longest substring
    print(len(longest)) # gives you the length of the longest substring