For an assignment I am currently trying to import and RData file into python that contains textual content and categories for the content. I have looked around the web and found the RData package in python that allows me to do this. However the package assumes the encoding of the text is ASCII while it in fact is UTF-8. I have looked through the documentation and I cannot find the way to change this standard assumed encoding.
Here is the code I am trying to do this with:
import rdata
parsed = rdata.parser.parse_file("news_dataset.rda")
converted = rdata.conversion.convert(parsed)
converted_df = pd.DataFrame(converted.get("df_final"))
while running this the following error is generated(i have left the irrelevant part of the path out):
conversion\_conversion.py:266: UserWarning: Unknown encoding. Assumed ASCII.
warnings.warn("Unknown encoding. Assumed ASCII.")
Because of this wrongful conversion I get weird sentences such as:
b'Ad sales boost Time Warner profit\n\nQuarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (\xc2\xa3600m) for the three months to December, from $639m year-earlier.
I am assuming this method should be able to handle this but I have no clue how to use in the conversion method.
Could anybody here help me? Thanks in advance!
[edit] I have also tried pyreadr, but when importing the module I get the following error:
ImportError Traceback (most recent call last)
<ipython-input-1-007a02a03c5d> in <module>
----> 1 import pyreadr
2
3 result = pyreadr.read_r("news_dataset.rda")
~\AppData\Roaming\Python\Python37\site-packages\pyreadr\__init__.py in <module>
----> 1 from .pyreadr import read_r, list_objects, write_rds, write_rdata, download_file
2 from .custom_errors import PyreadrError, LibrdataError
3
4 __version__ = "0.4.4"
5
~\AppData\Roaming\Python\Python37\site-packages\pyreadr\pyreadr.py in <module>
8 import pandas as pd
9
---> 10 from ._pyreadr_parser import PyreadrParser, ListObjectsParser
11 from ._pyreadr_writer import PyreadrWriter
12 from .custom_errors import PyreadrError
~\AppData\Roaming\Python\Python37\site-packages\pyreadr\_pyreadr_parser.py in <module>
15 pass
16
---> 17 from .librdata import Parser
18 from .custom_errors import PyreadrError
19
ImportError: DLL load failed: the specified module could not be found
Apparently this is caused by the fact that pyreadr was written in python 2.0 and not python 3.0(which is the version I am using). With 2to3 you are supposed to be able to do this conversion. But it hasn't worked for me yet.
I am the autor of the rdata
package.
The convert
function accepts the keyword parameter default_encoding
, that you can use to specify the encoding used when it is not explicitly declared in the string.
You can also use the force_default_encoding
if the encoding is explicitly declared but wrong.
Your code would be then:
import rdata
parsed = rdata.parser.parse_file("news_dataset.rda")
converted = rdata.conversion.convert(parsed, default_encoding="utf8")
converted_df = pd.DataFrame(converted.get("df_final"))
If you have further doubts about the package, feel free to open a discussion in the Github repo. I am notified of those and can usually answer in the same day.