Search code examples
pythonnlpnltktokenize

How to Tokenize block of text as one token in python?


Recently I am working on a genome data set which consists of many blocks of genomes. On previous works on natural language processing, I have used sent_tokenize and word_tokenize from nltk to tokenize the sentences and words. But when I use these functions on genome data set, it is not able to tokenize the genomes correctly. The text below shows some part of the genome data set.

>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1
aatgttttatataaattgcagtatgtgtcacccaaaatagcaaaccccat
aaccaaccagattattatgatacataatgcttatatgaaactaagacatt
tcgcaacatttattttaggtatataaatacatttattgaaggaattgata
tatgccagtaaaatggtgtatttttaatttctttcaataaaaacataatt
gacattatataaaaatgaattataaaactctaagcggtggatcactcggc
tcatgggtcgatgaagaacgcagcaaactgtgcgtcatcgtgtgaactgc
aggacacatgaacatcgacattttgaacgcatatcgcagtccatgctgtt
atgtactttaattaattttatagtgctgcttggactacatatggttgagg
gttgtaagactatgctaattaagttgcttataaatttttataagcatatg
gtatattattggataaatataataatttttattcataatattaaaaaata
aatgaaaaacattatctcacatttgaatgt
>NR_004047 1
atattcaggttcatcgggcttaacctctaagcagtttcacgtactgttta
actctctattcagagttcttttcaactttccctcacggtacttgtttact
atcggtctcatggttatatttagtgtttagatggagtttaccacccactt
agtgctgcactatcaagcaacactgactctttggaaacatcatctagtaa
tcattaacgttatacgggcctggcaccctctatgggtaaatggcctcatt
taagaaggacttaaatcgctaatttctcatactagaatattgacgctcca
tacactgcatctcacatttgccatatagacaaagtgacttagtgctgaac
tgtcttctttacggtcgccgctactaagaaaatccttggtagttactttt
cctcccctaattaatatgcttaaattcagggggtagtcccatatgagttg
>NR_004052 1

When the tokenizer of ntlk is applied on this dataset, each line of text (for example tattattatacacaatcccggggcgttctatatagttatgtataatgtat ) becomes one token which is not correct. and a block of sequences should be considered as one token. For example in this case contents between >NR_004049 1 and >NR_004048 1 should be consider as one token:

>NR_004049 1
tattattatacacaatcccggggcgttctatatagttatgtataatgtat
atttatattatttatgcctctaactggaacgtaccttgagcatatatgct
gtgacccgaaagatggtgaactatacttgatcaggttgaagtcaggggaa
accctgatggaagaccgaaacagttctgacgtgcaaatcgattgtcagaa
ttgagtataggggcgaaagaccaatcgaaccatctagtagctggttcctt
ccgaagtttccctcaggatagctggtgcattttaatattatataaaataa
tcttatctggtaaagcgaatgattagaggccttagggtcgaaacgatctt
aacctattctcaaactttaaatgggtaagaaccttaactttcttgatatg
aagttcaaggttatgatataatgtgcccagtgggccacttttggtaagca
gaactggcgctgtgggatgaaccaaacgtaatgttacggtgcccaaataa
caact
>NR_004048 1 

So each block starting with special words such as >NR_004049 1 until the next special character should be considered as one token. The problem here is tokenizing this kind of data set and i dont have any idea how can i correctly tokenize them. I really appreciate answers which helps me to solve this.

Update: One way to solve this problem is to append al lines within each block, and then using the nltk tokenizer. for example This means that to append all lines between >NR_004049 1 and >NR_004048 1 to make one string from several lines, so the nltk tokenizer will consider it as one token. Can any one help me how can i append lines within each block?


Solution

  • You just need to concatenate the lines between two ids apparently. There should be no need for nltk or any tokenizer, just a bit of programming ;)

    
    patterns = {}
    with open('data', "r") as f:
        id = None
        current = ""
        for line0 in f:
            line= line0.rstrip()
            if line[0] == '>' :  # new pattern
                if len(current)>0:
    #                print("adding "+id+"  "+current)
                    patterns[id] = current
                    current = ""
                # to find the next id:
                tokens = line.split(" ")
                id = tokens[0][1:]
            else: # continuing pattern
                current = current + line
        if len(current)>0:
            patterns[id] = current
    #        print("adding "+id+"  "+current)
    
    
    # do whatever with the patterns:
    for id, pattern in patterns.items():
        print(f"{id}\t{pattern}")