I'm using str.contains to search for rows where the column contains a particular string as a substring
df[df['col_name'].str.contains('find_this')]
This returns all the rows where 'find_this' is somewhere within the string. However, in the rare but important case where the string in df['col_name'] STARTS with 'find_this', this row is not returned by the above query.
str.contains() returns false where it should return true.
Any help would be greatly appreciated, thanks!
EDIT I've added some example data as requested. Image of dataframe. I want to update the 'Eqvnt_id' column, so for example, the rows where column 'Course_ID' contains AAS 102 all have the same 'Eqvnt_id' value.
To do this I need to be able to search the strings in 'Course_ID' for 'AAS 102' in order to locate the appropriate rows. However, when I do this:
df[df['Course_ID'].str.contains('AAS 102')]
The row that has 'AAS 102 (ENGL 102, JST 102, REL 102)' does not appear in the query!
The datatypes are all objects. I've tried mapping them and applying them to string type, but it has had no effect on the success of the query.
The data from the image can be found at https://github.com/isaachowen/stackoverflowquestionfiles
TLDR: Experiment with pandas.Series.str.normalize(), trying different Unicode forms until the issue is solved. 'NFKC' worked for me.
The problem had to do with the format of the data in the column that I was doing the...
df['column'].str.contains('substring')
...operation on. Using the pandas.Series.str.normalize() function works. Link here. Sometimes, under some circumstances that I can't deliberately recreate, the strings would have '\xa0' and '\n' appended to them at the beginning or the end of the string. This post helps address how to deal with that problem. Following that post, I for-looped through every string column and changed the unicode form until I found something that worked: 'NFKC'.