I recently downloaded a public dataset about movies on imdb (https://datasets.imdbws.com/title.basics.tsv.gz).
However, inspection after I load it in a dataframe shows that some rows are not correctly parsed in that some tabs are not recognized as the delimiter, although for nearly all rows this is not the case. screenshot: tabs are not recognized in some rows
Does anyone know what's going on? Why are most rows parsed correctly but not these Am I doing something wrong, or does it look like the problem of the dataset?
As a programming newbie, at first I thought it's something to do with encoding but according to https://developer.imdb.com/non-commercial-datasets/, UTF-8 is the one I should use. It doesn't look like a problem caused by quote or other special characters, neither. Now I'm stuck.
P.S. Another thing that confuses me in this picture, is that some rows are still selected where it looks like the primary title and the original title are the same, despite adding the condition (primaryTitle!=originalTitle). Would it have something to do with dtype? It'd be appreciated if you could enlighten me on this, too!
You're not doing anything wrong, there are just issues with the source data. I was able to read the data successfully with just:
df = pd.read_csv('title.basics.tsv', sep='\t', encoding='utf-8')
And still see these rows with the tabs included in the titles you've identified. This is due to improper quoting in the data - you can see this for movie id tt10233364
, the tab character is contained in the quotes:
tt10233364 tvEpisode "Rolling in the Deep Dish "Rolling in the Deep Dish 0 2019 \N \N Reality-TV
You will need to go back and clean these manually (or you can just drop them)