Search code examples
pythonrdata

Change encoding type when parsing RData file into Python using Rdata package


For an assignment I am currently trying to import and RData file into python that contains textual content and categories for the content. I have looked around the web and found the RData package in python that allows me to do this. However the package assumes the encoding of the text is ASCII while it in fact is UTF-8. I have looked through the documentation and I cannot find the way to change this standard assumed encoding.

Here is the code I am trying to do this with:

import rdata
parsed = rdata.parser.parse_file("news_dataset.rda")
converted = rdata.conversion.convert(parsed)
converted_df = pd.DataFrame(converted.get("df_final"))

while running this the following error is generated(i have left the irrelevant part of the path out):

conversion\_conversion.py:266: UserWarning: Unknown encoding. Assumed ASCII.
warnings.warn("Unknown encoding. Assumed ASCII.") 

Because of this wrongful conversion I get weird sentences such as:

b'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (\xc2\xa3600m) for the three months to December, from $639m year-earlier.

I am assuming this method should be able to handle this but I have no clue how to use in the conversion method.

Could anybody here help me? Thanks in advance!

[edit] I have also tried pyreadr, but when importing the module I get the following error:

ImportError                               Traceback (most recent call last)
<ipython-input-1-007a02a03c5d> in <module>
----> 1 import pyreadr
      2 
      3 result = pyreadr.read_r("news_dataset.rda")

~\AppData\Roaming\Python\Python37\site-packages\pyreadr\__init__.py in <module>
----> 1 from .pyreadr import read_r, list_objects, write_rds, write_rdata, download_file
      2 from .custom_errors import PyreadrError, LibrdataError
      3 
      4 __version__ = "0.4.4"
      5 

~\AppData\Roaming\Python\Python37\site-packages\pyreadr\pyreadr.py in <module>
      8 import pandas as pd
      9 
---> 10 from ._pyreadr_parser import PyreadrParser, ListObjectsParser
     11 from ._pyreadr_writer import PyreadrWriter
     12 from .custom_errors import PyreadrError

~\AppData\Roaming\Python\Python37\site-packages\pyreadr\_pyreadr_parser.py in <module>
     15     pass
     16 
---> 17 from .librdata import Parser
     18 from .custom_errors import PyreadrError
     19 

ImportError: DLL load failed: the specified module could not be found

Apparently this is caused by the fact that pyreadr was written in python 2.0 and not python 3.0(which is the version I am using). With 2to3 you are supposed to be able to do this conversion. But it hasn't worked for me yet.


Solution

  • I am the autor of the rdata package.

    The convert function accepts the keyword parameter default_encoding, that you can use to specify the encoding used when it is not explicitly declared in the string.

    You can also use the force_default_encoding if the encoding is explicitly declared but wrong.

    Your code would be then:

    import rdata
    parsed = rdata.parser.parse_file("news_dataset.rda")
    converted = rdata.conversion.convert(parsed, default_encoding="utf8")
    converted_df = pd.DataFrame(converted.get("df_final"))
    

    If you have further doubts about the package, feel free to open a discussion in the Github repo. I am notified of those and can usually answer in the same day.