Search code examples
pythondata-preprocessing

Check for invalid observation


I need to check and remove invalid observations from those containing any non-specific aminoacidic letters (namely B,J,X or Z) in the epitope sequence (Column in DF).

Epitope sequence is a column in the data frame that has values like the samples given below. I need to check if that sequence contains the letter, B,J,X,Z and if yes, drop all the corresponding records.

Epitope seq:

ACIIERKNRGELEYT
CDLNENQTWVDNGC
CASQEFDYEFDDVNE
DDDSYTTKRKF

The current code that I have is checking for each individually and that means writing 4 lines of code. Is there any better way of doing this i.e. all the 4 lines of code in one line using OR operator? If yes, how?

Current code:

final_df.drop(final_df[final_df['epit_seq'].str.contains('B')].index, inplace=True)
final_df.drop(final_df[final_df['epit_seq'].str.contains('J')].index, inplace=True)
final_df.drop(final_df[final_df['epit_seq'].str.contains('X')].index, inplace=True)
final_df.drop(final_df[final_df['epit_seq'].str.contains('Z')].index, inplace=True)

Solution

  • Since contains allows regular expression by default, you can shorten to one line as follows.

    ignore = '|'.join(['B', 'J', 'X', 'A']) # use regular expression with or on substrings
    final_df.drop(final_df[final_df['epit_seq'].str.contains(ignore)].index, inplace=True)