Python: How to split txt file data into csv according to user input

So I am reading in a .txt file that is largely similar to this: TTACGATATACGA etc. but contains thousands of characters. Now I can read in a file and output it as a csv according to user input that decides characters per column and number of columns however it writes a new file for each time.

Ideally I would like to have a format such as such per file:

User enters 4 and 3.

Output: TCAG, TGCT, TACG,

My curent output is this:

TCAGTGCTTACG

I have tried looking at string splitting but I don't seem to be able to get it to work.

here is what I've written thus far, apologies if it's poor:

#user input for parameters
user_input_character = int(input("Enter how many characters you;d like 

per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))

#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1

#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
        line = line.strip()
        if not line:
            continue

lines.append(',')

#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):

#print count used to measure progress for testing
    print('the count is', count)
    count += 1
    index += characters_to_read
    print('index: ',index)

#intake only uses letters from index count per file
    intake = read_file[index_start:index_finish]
    print(intake)

    index_start += characters_to_read
    index_finish +=characters_to_read

#output a txt file with the 4 letters from intake as a individually     numbered txt file
    text_file_output = open("Output%i.csv"%i,'w')
    i += 1
    text_file_output.write(intake)
    text_file_output.close()
#define path to print to console for file saving
    path = os.path.abspath("Output%i")
    directory = os.path.dirname(path)
    print(path)

test_file.close()

Solution

Here's a simple way to split your DNA data into rows consisting of columns and chunks of specified sizes. It assumes that the DNA data is in a single string with no white space characters (spaces, tabs, newlines, etc).

To test this code, I create some fake data using the random module.

from random import seed, choice
seed(42)

# Make some random DNA data
num = 66
data = ''.join([choice('ACGT') for _ in range(num)])
print(data, '\n')

# Split the data into chunks, columns and rows
chunksize, cols = 4, 3

row = []
for i in range(0, len(data), chunksize):
    chunk = data[i:i+chunksize]
    row.append(chunk)
    if len(row) == cols:
        print(' '.join(row))
        row = []
if row:
    print(' '.join(row))

output

AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG

AAGC CCAA TAAA
CCAC TCTG ACTG
GCCG AATA GGGA
TATA GGCA ACGA
CATG TGCG GCGA
CCCT TG

On my old 2GHz 32 bit machine, running Python 3.6.0, this code can process and save to disk around 100000 chars per second (that includes the time taken to generate the random data).

Here's a version of the above code that handles spaces and blank lines in the input data. It reads the input data from a file and writes the output to a CSV file.

Firstly, here's the code I used to create some fake test data, which I saved to "dnatest.txt".

from random import seed, choice, randrange
seed(123)

# Make some random DNA data containing spaces
pool = 'ACGT' * 5 + ' '
for _ in range(15):
    # Choose a random line length
    size = randrange(50, 70)
    data = ''.join([choice(pool) for _ in range(size)])
    print(data)
    # Randomly add a blank line
    if randrange(5) < 2:
        print()

Here's the file it created:

dnatest.txt

AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG

 ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
ATTGATGGCACAGC AAGATCCGCTA  CCGATTG CAACCA CATACGAT CGACCAGATGG
ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG

GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC

AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC

TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
 GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
TTAGTAATGAGTTCGGGC  GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT

Here's the code that processes that data:

# Input & output file names
iname = 'dnatest.txt'
oname = 'dnatest.csv'

# Read the data and eliminate all whitespace
with open(iname) as f:
    data = ''.join(f.read().split())

# Split the data into chunks, columns and rows
chunksize, cols = 4, 3

with open(oname, 'w') as f:
    row = []
    for i in range(0, len(data), chunksize):
        chunk = data[i:i+chunksize]
        row.append(chunk)
        if len(row) == cols:
            f.write(', '.join(row) + '\n')
            row = []
    if row:
        f.write(', '.join(row) + '\n')

And here's the file it creates:

dnatest.csv

AGCA, TCAC, CGGC
CAGC, GTCA, CGTA
GAGG, TCGA, AACC
GTAT, CCGA, TGTA
GGAC, CTTA, CTAC
CGTA, CGGC, AGGA
GGAG, GGTA, TTAC
ACCT, TCTC, ACGA
GCAA, GGAA, TAAT
TGAT, GGCA, CAGC
AAGA, TCCG, CTAC
CGAT, TGCA, ACCA
CATA, CGAT, CGAC
CAGA, TGGA, CAGA
ACAG, ATCT, TGGG
AATG, GAAC, AGGA
GAGA, GTGT, GGGC
CACA, TTAA, AGTG
ATAA, TATT, TTCT
GTCG, TGGG, GCAC
CAAA, CCAT, GCTA
ATGC, ACGA, CTGG
GTGA, GGGT, TGAG
AGCC, TACT, ATCC
TCAG, TCGA, TCGA
GATG, ACCC, TCCT
ATCG, CAAC, AGCT
GTCA, GTGT, CCAG
AGAC, GTCG, CCAT
AGGT, CTGG, AAAC
GCAC, TCCC, CTCG
GAAT, AGTC, TACA
CGAG, TCCA, TTAT
GTCG, ATCT, GACT
ATGG, GGAC, CATA
ACGG, CTAT, GCGA
CCAT, GGAC, TGGT
TCGA, GGAT, TCCC
GTTC, TACA, TCAC
CTTA, CCTC, TGAT
AACG, ACTG, GTTC
GAGG, GTCT, CCCA
AACG, TCTA, TTAT
GTCA, TAAC, GTAA
CTCT, GCCG, TAGT
TTGA, TCAA, ACGT
ACAG, CCAC, CACT
GAAG, CCGC, CTCG
AACC, GCGT, CCGA
CCCT, GGGG, AGCC
TGGG, GCCC, AGCA
CCTT, AGCA, CTGC
GAAG, CTAC, ACCC
CACG, AGTA, ATTT
GTCT, ATCG, TCCG
GCCT, CGTT, TCCT
TGTG, AAAT, TATA
TGGT, CAGT, CTTC
AATC, AACA, CCTA
CTAA, TAAG, TGCT
AGCC, CGGG, GATC
TTGT, CCTG, GTCC
AGGT, CATA, ATCC
GTGC, TCAA, ATTA
CATG, GCTT, TTAG
TAAT, GAGT, TCGG
GCGC, GCCC, TCAA
AGTT, GGTC, TAGA
AGCG, CGCA, GTTT
TCCT, TAGG, T