Search code examples
pythonpython-imaging-librarybioinformaticsdna-sequence

How to create a barcode from a DNA sequence using Python and PIL library?


i am a beginner and want to make a barcode out of this DNA sequence by using pyhton code. it's supposed to read each 1024 nucleotide and checks for mers (a combination of 4 nucleotides i.g. AAAA, AAAC, AAAG, AAAT ..... TTTT). each mer holds an index in an array (size = 256) if it found AAAA within the first 1024 it stores its count in it's index and so on, then to the next 1024 until it's done with the whole sequence. that will create a 2D array which will be turned into a png in gray scale.

my problem is that it took only the first 1024 and displayed it on the entire 1024X256 image.

DNA: https://1drv.ms/f/s!AuXxv7yqjA_FlS_ujYOMUvikWg8E

#read the DNA sequence
fasta_file = open(r'C:path\Escherichia_coli_ATCC_10798.fasta','r')
SE =fasta_file.read()
fasta_file.close()
seq = SE[177:]
dna_sequence = seq.replace("\n","")

# Sample size and mer length
#sample is the window that will go thorugh the whole sequance
sample_size = 1024
mer_length = 4

# Array to store the counts of each mer
barcode = [0] * 256

# Generate all possible 4-mers
mers = []
for i in range(256):
    mer = ""
    for j in range(4):
        mer += "ACGT"[i % 4]
        i //= 4
    mers.append(mer)


# Loop through the sample and count the occurrences of each mer
for w in range(sample_size):
    mer = dna_sequence[w:w+mer_length]
    barcode[mers.index(mer)] += 2

# Print the counts of each mer
    #print(mers[i], ":", barcode[i])
print(barcode)



# image
    # Python program to convert numpy array to image
    # import pillow library
from PIL import Image
import numpy as np

    # define a main function
    # Create the barcode array with the same shape as the desired image
code = np.array(barcode, dtype=np.uint8)

    # Create an Image object from the barcode array
image = Image.fromarray(code)

    # Reshape the image to the desired size (1024x4000)
image = image.resize((1024, 4000))

    # Save the image
image.save('Escherichia_coli.png')

    # Display the image (optional)
image.show()

the dark image is what i got the other one is what i was supposed to getenter image description here my output

i don't know how to attach the DNA sequence. some info: genome_id="531534bd23a542ae" atcc_catalog_number="ATCC 10798" species="Escherichia coli"

link to similar genome:

https://www.ncbi.nlm.nih.gov/assembly/GCA_028644645.1


Solution

  • I'm seeing two questions here:

    my problem is that it took only the first 1024 and displayed it on the entire 1024X256 image.

    You'll need nested for-loops to cover the full sequence for each 4-mer. The for-loop in your code covers the first window (1024 nucleotides) only.

    the dark image is what i got the other one is what i was supposed to get

    The solution is to scale your count array by some large value. This will increase the brightness. I'm using 20 as an example here.

    I'd also suggest using Biopython's SeqIO interface to handle fasta sequences as SeqRecord objects. Here's the updated code:

    from Bio import SeqIO
    from PIL import Image
    import numpy as np
    
    #read the DNA sequence
    record = SeqIO.read(r'C:path\Escherichia_coli_ATCC_10798.fasta', "fasta")
    
    # Sample size and mer length
    #sample is the window that will go thorugh the whole sequance
    window      = 1024
    mer_length  = 4
    n_windows   = int(np.ceil(len(record.seq)/window))
       
    mers = []
    for i in range(256):
        mer = ""
        for j in range(4):
            mer += "ACGT"[i % 4]
            i //= 4
        mers.append(mer)
    
    # Array to store the counts of each mer 
    # Each row represents 1024 nucleotides, last row counts are handled by SeqIO 
    # even if the remaining window length is less than 1024
    # 256 cols (each col represents a 4-mer)
    image_array = np.zeros((n_windows, len(mers)), dtype=int)
    
    # Loop through the sample and count the occurrences of each mer
    for w in range(0,n_windows-1):
        for i in range(len(mers)):
            image_array[w,i] = 20*record.seq[1024*w:1024*(w+1)].count(mers[i])
    
    # image
    # Create an Image object from the image_array
    # image = Image.fromarray(image_array)
    
    image = Image.fromarray(image_array.astype(np.uint8))  # modded by me because of TypeError: Cannot handle this data type: (1, 1), <i8
    
    
    # Reshape the image to the desired size (1024x4000)
    image = image.resize((1024, 4000))
    # Save the image
    image.save('Escherichia_coli.png')
    # Display the image (optional)
    image.show()
    

    The above code should give you an output similar to this:

    E. coli barcode image from genome