Search code examples
pythonstatisticsstatsmodels

How to use statsmodels get_rdataset?


Python's statsmodels library has get_rdataset() method that can fetch various datasets. Where is the list of datasets that can be fetched? How do I use it to load datasets?

The documentation has no mention of which datasets are available. It merely says that dataname: The name of the dataset you want to download is a required parameter but does not mention which datanames are possible anywhere.


Solution

  • A CSV containing meta information about all datasets may be found at https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/datasets.csv which is defined as variable index_url in _get_dataset_meta() function in the statsmodels.datasets.utils module.

    When this dataset is loaded, e.g. using pandas, its first 5 rows look like below.

    import pandas as pd
    datasets = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/datasets.csv")
    datasets.head()
    

    datasets

    As the documentation shows, the first argument of get_rdataset() is the dataname (recorded as Item in the meta dataset) and the second argument is the package name the dataset belongs to. So for example, the following retrieves the first dataset in the CSV (because the dataname is Affairs which is in the AER package).

    import statsmodels.api as sm
    df = sm.datasets.get_rdataset('Affairs', 'AER', cache=True).data
    df.head()
    

    Affairs


    The list of all available datasets can also be found here. This is also referenced in the Using Datasets from R section of the Datasets package documentation.

    Thanks @Vitalizzare for pointing me to this repo.