Search code examples
ctext-formatting

Reformat text with arbitrary newlines into rows of equal length


Sorry that this is probably a really basic question - it seems like it ought to be easy, but I just can't figure it out.

I'm trying to write a function that, given a stream of alphabetical characters with arbitarily spaced newlines (in practice, this means a raw or FASTA-formatted nucleotide sequence), produces a stream which is identical in terms of content but with a newline every fifty characters exactly. That is, if I were to provide it either the following input (the 16S rRNA gene sequence of E. coli U 5/41 with FASTA-compatible row lengths):

AGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAG
CAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGA
TAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGCACAAAGAGGGGGACCTTAGGGCCTCTT
GCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAG
CTGGTCTGAGAGGATGACCAGCAACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTG
GGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCNGCGTGTATGAAGAAGGCCTTCGGGTTGT
AAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAA
GCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGC
GTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTG
ATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCT
GGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGC
AAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCG
TGGCTTCCGGANNTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAA
TTGACGGGGGCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTC
TTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCT
GTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCA
GCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC
ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAG
AGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAA
TCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACA
CCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACTTCGGGAGGGCG

or the following input (unbroken sequence):

AGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGCACAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCAACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGATAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACTTCGGGAGGGCG

— it should output the following (the same sequence, with newlines every 50 characters/bases):

AGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCA
AGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGA
CGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGG
AAACGGTAGCTAATACCGCATAACGTCGCAAGCACAAAGAGGGGGACCTT
AGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGG
GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCA
GCAACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTG
GGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCGCGTGTATG
AAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAA
AGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAA
CTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAA
TTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAAT
CCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC
GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTG
GAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAG
GTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGC
CGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGA
TAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTC
AAATGAATTGACGGGGGCCGCACAAGCGGTGGAGCATGTGGTTTAATTCG
ATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAG
AGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTC
GTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACC
CTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCA
GTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTA
CGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACC
TCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGT
CTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAA
TGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCA
TGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACTTCGGGAGGGCG

The code I've written to do this is:

void
output(char *seq, int len){
    for(int i = 0, j = 0; i < len; i++){
        if(seq[i] != '\n'){
                putc(seq[i], stdout);
                j++;
        }
        if(j % 50 == 0)
            putc('\n', stdout);
    }
    putc('\n', stdout);
}

This produces the correct, expected results if the input contains no newlines (the second example above). However, if the input contains newlines (the first example above), there are extraneous ones included in the output:

AGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCA
AGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGA
CGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGG
AAACGGTAGCTAATACCGCATAACGTCGCAAGCACAAAGAGGGGGACCTT
AGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGG
GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCA
GCAACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTG

GGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCGCGTGTATG
AAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAA
AGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAA
CTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAA
TTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAAT
CCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC
GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTG
GAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAG
GTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGC
CGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGA


TAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTC
AAATGAATTGACGGGGGCCGCACAAGCGGTGGAGCATGTGGTTTAATTCG
ATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAG
AGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTC
GTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACC
CTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCA
GTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTA
CGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACC
TCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGT
CTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAA
TGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCA
TGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACTTCGGGAGGGCG

I've tried playing with the numbers around the modulo, to no avail. I'm not sure why this would fail otherwise. How can I make this code behave as expected?


Solution

  • Just move the if condition checking new-break-length inside.

    You can also restrict checking against allowed nucleotides with local look-up table.

    Noticed three(3) Ns in the input. Wiki Nucleotide - Abbreviation Codes, says any base (not a gap), but expected output & your erroneous output don't have them.

    #define LINE_BREAK_LEN  50
    void
    output (const char *seq, const int len) {
        char nlt [256] = {0};
        nlt ['A'] = nlt['G'] = nlt['C'] = nlt['T'] = nlt['U'] = 1;
        //nlt ['a'] = nlt['g'] = nlt['c'] = nlt['t'] = nlt['u'] = 1; // in case you want to allow lower-case too
    
        for (int i = 0, j = 0; i < len; i++) {
            if (nlt [(unsigned char) seq[i]]) {
                putc (seq[i], stdout);
                j++;
                if (0 == (j % LINE_BREAK_LEN))
                    putc ('\n', stdout);
            }
        }
        putc ('\n', stdout);
    }