Search code examples
pythoncsvread-write

Python: How to split txt file data into csv according to user input


So I am reading in a .txt file that is largely similar to this: TTACGATATACGA etc. but contains thousands of characters. Now I can read in a file and output it as a csv according to user input that decides characters per column and number of columns however it writes a new file for each time.

Ideally I would like to have a format such as such per file:

User enters 4 and 3.

Output: TCAG, TGCT, TACG,

My curent output is this:

TCAGTGCTTACG

I have tried looking at string splitting but I don't seem to be able to get it to work.

here is what I've written thus far, apologies if it's poor:

#user input for parameters
user_input_character = int(input("Enter how many characters you;d like 

per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))

#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1

#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
        line = line.strip()
        if not line:
            continue

lines.append(',')

#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):

#print count used to measure progress for testing
    print('the count is', count)
    count += 1
    index += characters_to_read
    print('index: ',index)

#intake only uses letters from index count per file
    intake = read_file[index_start:index_finish]
    print(intake)

    index_start += characters_to_read
    index_finish +=characters_to_read

#output a txt file with the 4 letters from intake as a individually     numbered txt file
    text_file_output = open("Output%i.csv"%i,'w')
    i += 1
    text_file_output.write(intake)
    text_file_output.close()
#define path to print to console for file saving
    path = os.path.abspath("Output%i")
    directory = os.path.dirname(path)
    print(path)

test_file.close()

Solution

  • Here's a simple way to split your DNA data into rows consisting of columns and chunks of specified sizes. It assumes that the DNA data is in a single string with no white space characters (spaces, tabs, newlines, etc).

    To test this code, I create some fake data using the random module.

    from random import seed, choice
    seed(42)
    
    # Make some random DNA data
    num = 66
    data = ''.join([choice('ACGT') for _ in range(num)])
    print(data, '\n')
    
    # Split the data into chunks, columns and rows
    chunksize, cols = 4, 3
    
    row = []
    for i in range(0, len(data), chunksize):
        chunk = data[i:i+chunksize]
        row.append(chunk)
        if len(row) == cols:
            print(' '.join(row))
            row = []
    if row:
        print(' '.join(row))
    

    output

    AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG
    
    AAGC CCAA TAAA
    CCAC TCTG ACTG
    GCCG AATA GGGA
    TATA GGCA ACGA
    CATG TGCG GCGA
    CCCT TG
    

    On my old 2GHz 32 bit machine, running Python 3.6.0, this code can process and save to disk around 100000 chars per second (that includes the time taken to generate the random data).


    Here's a version of the above code that handles spaces and blank lines in the input data. It reads the input data from a file and writes the output to a CSV file.

    Firstly, here's the code I used to create some fake test data, which I saved to "dnatest.txt".

    from random import seed, choice, randrange
    seed(123)
    
    # Make some random DNA data containing spaces
    pool = 'ACGT' * 5 + ' '
    for _ in range(15):
        # Choose a random line length
        size = randrange(50, 70)
        data = ''.join([choice(pool) for _ in range(size)])
        print(data)
        # Randomly add a blank line
        if randrange(5) < 2:
            print()
    

    Here's the file it created:

    dnatest.txt

    AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG
    
     ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
    ATTGATGGCACAGC AAGATCCGCTA  CCGATTG CAACCA CATACGAT CGACCAGATGG
    ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
    TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
    TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
    TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
    GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG
    
    GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC
    
    AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC
    
    TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
    CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
    GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
     GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
    TTAGTAATGAGTTCGGGC  GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT
    

    Here's the code that processes that data:

    # Input & output file names
    iname = 'dnatest.txt'
    oname = 'dnatest.csv'
    
    # Read the data and eliminate all whitespace
    with open(iname) as f:
        data = ''.join(f.read().split())
    
    # Split the data into chunks, columns and rows
    chunksize, cols = 4, 3
    
    with open(oname, 'w') as f:
        row = []
        for i in range(0, len(data), chunksize):
            chunk = data[i:i+chunksize]
            row.append(chunk)
            if len(row) == cols:
                f.write(', '.join(row) + '\n')
                row = []
        if row:
            f.write(', '.join(row) + '\n')
    

    And here's the file it creates:

    dnatest.csv

    AGCA, TCAC, CGGC
    CAGC, GTCA, CGTA
    GAGG, TCGA, AACC
    GTAT, CCGA, TGTA
    GGAC, CTTA, CTAC
    CGTA, CGGC, AGGA
    GGAG, GGTA, TTAC
    ACCT, TCTC, ACGA
    GCAA, GGAA, TAAT
    TGAT, GGCA, CAGC
    AAGA, TCCG, CTAC
    CGAT, TGCA, ACCA
    CATA, CGAT, CGAC
    CAGA, TGGA, CAGA
    ACAG, ATCT, TGGG
    AATG, GAAC, AGGA
    GAGA, GTGT, GGGC
    CACA, TTAA, AGTG
    ATAA, TATT, TTCT
    GTCG, TGGG, GCAC
    CAAA, CCAT, GCTA
    ATGC, ACGA, CTGG
    GTGA, GGGT, TGAG
    AGCC, TACT, ATCC
    TCAG, TCGA, TCGA
    GATG, ACCC, TCCT
    ATCG, CAAC, AGCT
    GTCA, GTGT, CCAG
    AGAC, GTCG, CCAT
    AGGT, CTGG, AAAC
    GCAC, TCCC, CTCG
    GAAT, AGTC, TACA
    CGAG, TCCA, TTAT
    GTCG, ATCT, GACT
    ATGG, GGAC, CATA
    ACGG, CTAT, GCGA
    CCAT, GGAC, TGGT
    TCGA, GGAT, TCCC
    GTTC, TACA, TCAC
    CTTA, CCTC, TGAT
    AACG, ACTG, GTTC
    GAGG, GTCT, CCCA
    AACG, TCTA, TTAT
    GTCA, TAAC, GTAA
    CTCT, GCCG, TAGT
    TTGA, TCAA, ACGT
    ACAG, CCAC, CACT
    GAAG, CCGC, CTCG
    AACC, GCGT, CCGA
    CCCT, GGGG, AGCC
    TGGG, GCCC, AGCA
    CCTT, AGCA, CTGC
    GAAG, CTAC, ACCC
    CACG, AGTA, ATTT
    GTCT, ATCG, TCCG
    GCCT, CGTT, TCCT
    TGTG, AAAT, TATA
    TGGT, CAGT, CTTC
    AATC, AACA, CCTA
    CTAA, TAAG, TGCT
    AGCC, CGGG, GATC
    TTGT, CCTG, GTCC
    AGGT, CATA, ATCC
    GTGC, TCAA, ATTA
    CATG, GCTT, TTAG
    TAAT, GAGT, TCGG
    GCGC, GCCC, TCAA
    AGTT, GGTC, TAGA
    AGCG, CGCA, GTTT
    TCCT, TAGG, T