So I am reading in a .txt file that is largely similar to this: TTACGATATACGA etc. but contains thousands of characters. Now I can read in a file and output it as a csv according to user input that decides characters per column and number of columns however it writes a new file for each time.
Ideally I would like to have a format such as such per file:
User enters 4 and 3.
Output: TCAG, TGCT, TACG,
My curent output is this:
TCAGTGCTTACG
I have tried looking at string splitting but I don't seem to be able to get it to work.
here is what I've written thus far, apologies if it's poor:
#user input for parameters
user_input_character = int(input("Enter how many characters you;d like
per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))
#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1
#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
line = line.strip()
if not line:
continue
lines.append(',')
#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):
#print count used to measure progress for testing
print('the count is', count)
count += 1
index += characters_to_read
print('index: ',index)
#intake only uses letters from index count per file
intake = read_file[index_start:index_finish]
print(intake)
index_start += characters_to_read
index_finish +=characters_to_read
#output a txt file with the 4 letters from intake as a individually numbered txt file
text_file_output = open("Output%i.csv"%i,'w')
i += 1
text_file_output.write(intake)
text_file_output.close()
#define path to print to console for file saving
path = os.path.abspath("Output%i")
directory = os.path.dirname(path)
print(path)
test_file.close()
Here's a simple way to split your DNA data into rows consisting of columns and chunks of specified sizes. It assumes that the DNA data is in a single string with no white space characters (spaces, tabs, newlines, etc).
To test this code, I create some fake data using the random
module.
from random import seed, choice
seed(42)
# Make some random DNA data
num = 66
data = ''.join([choice('ACGT') for _ in range(num)])
print(data, '\n')
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
print(' '.join(row))
row = []
if row:
print(' '.join(row))
output
AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG
AAGC CCAA TAAA
CCAC TCTG ACTG
GCCG AATA GGGA
TATA GGCA ACGA
CATG TGCG GCGA
CCCT TG
On my old 2GHz 32 bit machine, running Python 3.6.0, this code can process and save to disk around 100000 chars per second (that includes the time taken to generate the random data).
Here's a version of the above code that handles spaces and blank lines in the input data. It reads the input data from a file and writes the output to a CSV file.
Firstly, here's the code I used to create some fake test data, which I saved to "dnatest.txt".
from random import seed, choice, randrange
seed(123)
# Make some random DNA data containing spaces
pool = 'ACGT' * 5 + ' '
for _ in range(15):
# Choose a random line length
size = randrange(50, 70)
data = ''.join([choice(pool) for _ in range(size)])
print(data)
# Randomly add a blank line
if randrange(5) < 2:
print()
Here's the file it created:
AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG
ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
ATTGATGGCACAGC AAGATCCGCTA CCGATTG CAACCA CATACGAT CGACCAGATGG
ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG
GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC
AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC
TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
TTAGTAATGAGTTCGGGC GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT
Here's the code that processes that data:
# Input & output file names
iname = 'dnatest.txt'
oname = 'dnatest.csv'
# Read the data and eliminate all whitespace
with open(iname) as f:
data = ''.join(f.read().split())
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
with open(oname, 'w') as f:
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
f.write(', '.join(row) + '\n')
row = []
if row:
f.write(', '.join(row) + '\n')
And here's the file it creates:
AGCA, TCAC, CGGC
CAGC, GTCA, CGTA
GAGG, TCGA, AACC
GTAT, CCGA, TGTA
GGAC, CTTA, CTAC
CGTA, CGGC, AGGA
GGAG, GGTA, TTAC
ACCT, TCTC, ACGA
GCAA, GGAA, TAAT
TGAT, GGCA, CAGC
AAGA, TCCG, CTAC
CGAT, TGCA, ACCA
CATA, CGAT, CGAC
CAGA, TGGA, CAGA
ACAG, ATCT, TGGG
AATG, GAAC, AGGA
GAGA, GTGT, GGGC
CACA, TTAA, AGTG
ATAA, TATT, TTCT
GTCG, TGGG, GCAC
CAAA, CCAT, GCTA
ATGC, ACGA, CTGG
GTGA, GGGT, TGAG
AGCC, TACT, ATCC
TCAG, TCGA, TCGA
GATG, ACCC, TCCT
ATCG, CAAC, AGCT
GTCA, GTGT, CCAG
AGAC, GTCG, CCAT
AGGT, CTGG, AAAC
GCAC, TCCC, CTCG
GAAT, AGTC, TACA
CGAG, TCCA, TTAT
GTCG, ATCT, GACT
ATGG, GGAC, CATA
ACGG, CTAT, GCGA
CCAT, GGAC, TGGT
TCGA, GGAT, TCCC
GTTC, TACA, TCAC
CTTA, CCTC, TGAT
AACG, ACTG, GTTC
GAGG, GTCT, CCCA
AACG, TCTA, TTAT
GTCA, TAAC, GTAA
CTCT, GCCG, TAGT
TTGA, TCAA, ACGT
ACAG, CCAC, CACT
GAAG, CCGC, CTCG
AACC, GCGT, CCGA
CCCT, GGGG, AGCC
TGGG, GCCC, AGCA
CCTT, AGCA, CTGC
GAAG, CTAC, ACCC
CACG, AGTA, ATTT
GTCT, ATCG, TCCG
GCCT, CGTT, TCCT
TGTG, AAAT, TATA
TGGT, CAGT, CTTC
AATC, AACA, CCTA
CTAA, TAAG, TGCT
AGCC, CGGG, GATC
TTGT, CCTG, GTCC
AGGT, CATA, ATCC
GTGC, TCAA, ATTA
CATG, GCTT, TTAG
TAAT, GAGT, TCGG
GCGC, GCCC, TCAA
AGTT, GGTC, TAGA
AGCG, CGCA, GTTT
TCCT, TAGG, T