I'm working with prokka annotation files who give me the protein product of a gene found in the uniprot database. Unfortunately, many genes are linked with multiple, very similar product names, e.g.
1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2 phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl CoA epoxidase%2C subunit A
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A
whereas these variants are actually different products
1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl-CoA epoxidase%2C subunit B
1%2C2-phenylacetyl-CoA epoxidase%2C subunit C
1%2C2-phenylacetyl-CoA epoxidase%2C subunit E
To avoid trouble when mapping my genes to their respective products, I decided to substitute all possible ambiguities and problematic characters such as "-" " " "/" with "@" and put all strings to lower case.
But would there be a way to search e.g. for
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A
including possible, closely related entries with standard unix tools as grep? I could not find an answer so far.
If you want true fuzzy search, defined by string distance metrics, check out tre-agrep. For your application, I would use grep with case-insensitive matching and period special characters.
grep -i "1.2C2.phenylacetyl.CoA.epoxidase.2C subunit A" drugNames.txt
will match any character in the place of periods, and does not pay attention to case, which is what you want.