Search code examples
rpandasreticulate

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 125: invalid start byte in R with Reticulate


Good morning guys, I was writing a small script to manage the data in R, but, I don't understand why, when I import an huge csv (3.5 gb) file in R, it doesn't work. To solve this problem quickly I decide to use pandas with reticulate.

#Package from python
pd<-import("pandas", as="pd")
#leggo il file csv con pandas
pd$read_csv("C:\\Users\\Befrancesco\\Desktop\\X_dataset\\x_file_name.csv, error_bad_lines= FALSE, encoding = "utf-8" )

R returns me this type of error:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 105: invalid start byte 

Where I wronge?

Thank you in advance for oyour answer.

Francesco


Solution

  • It could be that your encoding isn't UTF-8. Try some of the other encodings, such as ISO-8859-1 in your read_csv call e.g.

    pd$read_csv("C:\\Users\\Befrancesco\\Desktop\\X_dataset\\x_file_name.csv, error_bad_lines= FALSE, encoding = "ISO-8859-1")
    

    See this answer for more on different encodings: https://stackoverflow.com/a/18172249/5269252