Search code examples
pythonscikit-learnscikits

How to load the datset using scikits


I am working on recommender systems and I am trying to use scikits.crab package to use the basic algorithms in recommender systems. However, in every tutorial, in the examples, they just used scikits own datsets and I didn't find anything about how to load an external datset ( for example from my computer) This is what you see in every tutorial for scikits. crab:

from scikits.crab import datsets
movies=datsets.load_sample_movies()
model=MatrixPreferenceDataModel(movies.data)

However, as I said, I need to load a datset from my machine that can be used by the scikits methods


Solution

  • Here is a relevant section of the crab tutorial.

    In your above example, you're only using the movies.data field for your model. movies.data looks like the following:

    >>> print movies.data
    {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},
     2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},
     3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},
     4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},
     5: {2: 4.5, 3: 1.0, 4: 4.0},
     6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},
     7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}}
    

    This is just a dictionary where the key is the user (represented here by 1, 2, 3, 4, 5, 6, and 7) and the value is another dictionary, where the key is the movie ID and the value is the rating. So you just need to construct a nested dictionary.

    From the source, the authors load the data from .csv files with the following code:

    def load_sample_movies():
    
        base_dir = join(dirname(__file__), 'data/')
    
        #Read data
        data_m = np.loadtxt(base_dir + 'sample_movies.csv',
                delimiter=';', dtype=str)
        item_ids = []
        user_ids = []
        data_songs = {}
        for user_id, item_id, rating in data_m:
            if user_id not in user_ids:
                user_ids.append(user_id)
            if item_id not in item_ids:
                item_ids.append(item_id)
            u_ix = user_ids.index(user_id) + 1
            i_ix = item_ids.index(item_id) + 1
            data_songs.setdefault(u_ix, {})
            data_songs[u_ix][i_ix] = float(rating)
    
        data_t = []
        for no, item_id in enumerate(item_ids):
            data_t.append((no + 1, item_id))
        data_titles = dict(data_t)
    
        data_u = []
        for no, user_id in enumerate(user_ids):
            data_u.append((no + 1, user_id))
        data_users = dict(data_u)
    
        fdescr = open(dirname(__file__) + '/descr/sample_movies.rst')
    
        return Bunch(data=data_songs, item_ids=data_titles,
                     user_ids=data_users, DESCR=fdescr.read())
    

    And the .csv file that this data is located in is in the form of:

    Jack Matthews;Lady in the Water;3.0
    Jack Matthews;Snakes on a Planet;4.0
    Jack Matthews;You, Me and Dupree;3.5
    Jack Matthews;Superman Returns;5.0
    Jack Matthews;The Night Listener;3.0
    Mick LaSalle;Lady in the Water;3.0
    Mick LaSalle;Snakes on a Planet;4.0
    Mick LaSalle;Just My Luck;2.0
    Mick LaSalle;Superman Returns;3.0
    Mick LaSalle;You, Me and Dupree;2.0
    Mick LaSalle;The Night Listener;3.0
    

    Therefore, if you want to make your own dataset, you have two options. Either format it yourself into the dictionary format that the recommender needs, or write a method based off their imports that formats it for you.

    It doesn't seem like the project has a general "import from csv" method that I could find - I may just be missing it, having only browsed it.

    Luckily, since the recommender only seems to want the dictionary, you don't need to have the extra description file and all that, just formatting your data correctly seems to be enough.