Search code examples
pythonpandasdataframebioinformaticsbiopython

Key Error in Bioinformatics Program Using Pandas


I'll try to keep this as short as possible. I'm trying to create a bioinformatics program for our patient 'reporting' team. To preface this, examples I will be giving are just examples and not actual patient information.

The script I'm writing will take the results of a patients genetic test, take their nucleotide results via specific snps we test for.(organized via rsID from NCBI). This patient information is merged with a reference library I've made and will be compared with it. The goal is to 1.)Merge these files. 2.)Have patient nucleotide results compared to the nucleotides from the reference library. 3.) Create a "Flag" if the patients nucleotide is rare and from a small frequency percentage.

The issue I'm having, is that when running the script, after uploading the patient file and population data, I'm getting a Key Error, as its not able to find the rsID column on the patient .csv.

I'll add 2 photos of what each .csv file looks like

enter image description here population data

enter image description here patient data

Here is a short excerpt of the code

onClick('Upload Patient Files First')
patient_data = pd.read_csv(ask_path(),)

###patient_genotype = patient_data.loc[patient_data['rsID'] == rsID]['NCBI SNP Reference']
##Not using

onClick('Upload Population Frequency Data Next')
pop_ref_data = pd.read_csv(ask_path())


#Creating a dictionary of the population reference data
def pop_dict(pop_ref_data):
    pop_ref_dict = {}
    for _, row in pop_ref_data.iterrows():
        variant_data ={}
        rsID = row['rsID']
        dominant_nucleotide = row['DomNucl']
        recessive_nucleotide = row['RecNucl']
        dominant_freq = row['DomAllele']
        recessive_freq = row['RecessiveAllele']

        variant_data[dominant_nucleotide]= dominant_freq
        variant_data[recessive_nucleotide]= recessive_freq

        pop_ref_dict[rsID] = variant_data
    return pop_ref_dict

The population data is pretty straight forward. I'm getting stuck on the first check though. under the column "rsID" is where i'm getting the Key Error.

The patient data is further down on its respective CSV. I'm trying to get it to find the information under the columns 'NCBI SNP Reference' and 'Call'.

Quick Edit: These are my Traceback calls. Also, to answer another question... Yes, I'm trying to bypass all of the header info on the CSV so that I can just use the bulk information I actually need once the genotyping run is finished.

Traceback (most recent call last): File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\indexes\base.py", line 3802, in get_loc return self._engine.get_loc(casted_key) File "pandas_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc File "pandas_libs\hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas_libs\hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'rsID'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\rcthu\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\Flag Process 2.12.py", line 61, in pop_ref_row = pop_dict(pop_ref_data) File "C:\Users\rcthu\AppData\Roaming\JetBrains\PyCharmCE2022.2\scratches\Flag Process 2.12.py", line 41, in pop_dict rsID = row['rsID'] File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\series.py", line 981, in getitem return self._get_value(key) File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\series.py", line 1089, in _get_value loc = self.index.get_loc(label) File "C:\Users\rcthu\PycharmProjects\WorkStuff\venv\lib\site-packages\pandas\core\indexes\base.py", line 3804, in get_loc raise KeyError(key) from err KeyError: 'rsID'

Process finished with exit code 1


Solution

  • The first thing to notice is that 'rsID' is the first key that you are calling. Looking at your data, rsID may not be what you expect since it is over an index.

    You should be able to set a breakpoint before the line that breaks and run your code in debug mode. Once you're at the breakpoint you should be able to see what 'row' really is and what keys it has.

    You could also just print(row) then return to get the first one.

    Hope this helps.