Search code examples
pythonbioinformaticsgenomevcf-variant-call-format

Extracting lines of text depending on the len() of a particular column


I'm trying to write a simple script to extract particular data from a VCF file, which displays variants in genome sequences.

The script needs to extract the header from the file, as well as SNVs while omitting any indels. Variants are displayed in 2 columns, the ALT and REF. Each column is separated by white space. Indels will have 2 characters in the ALT or REF, SNVs will always have 1.

What I have so far extracts the headers (which always begin with ##), but not any of the variant data.

original_file = open('/home/user/Documents/NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')

for line in original_file:
   if '##' in line:
       extracted_file.write(line)

# Extract SNVs while omitting indels 
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively

for line in original_file:
    ref = str.split()[3]
    alt = str.split()[4]
    if len(ref) == 1 and len(alt) == 1:
        extracted_file.write(line)

original_file.close()
extracted_file.close()

Solution

  • original_file = open('NA12878.vcf', 'r')
    extracted_file = open('NA12878_SNV.txt', 'w+')
    i=0
    
    for line in original_file:
        if '##' in line:
            extracted_file.write(line)
        else:
            ref = line.split('  ')[3]
            alt = line.split('  ')[4]
            if len(ref) == 1 and len(alt) == 1:
                extracted_file.write(line)
    
    # Extract SNVs while omitting indels 
    # Indels will have multiple entries in the REF or ALT column
    # The ALT and REF columns appear at position 4 & 5 respectively
    
    original_file.close()
    extracted_file.close()
    

    There Was two issues:

    1. The second loop never get executed because you already get to the end of the VCF file in the first loop. You can see here how to start over new loop on the same text file.
    2. You didn't separate the line correctly it is tab delimited.

    So i set the code to execute with only one loop and placed tab as the split parameter.