python search random machine-learning indices

How to get the indices for randomly selected rows in a list (Python)

Okay, I don't know if I phrased it badly or something, but I can't seem to find anything similar here for my problem.

So I have a 2D list, each row representing a case and each column representing a feature (for machine learning). In addition, I have a separated list (column) as labels.

I want to randomly select the rows from the 2D list to train a classifier while using the rest to test for accuracy. Thus I want to be able to know all the indices of rows I used for training to avoid repeats.

I think there are 2 parts of the question: 1) how to randomly select 2) how to get indices

again I have no idea why I can't find good info here by searching (maybe I just suck)

Sorry I'm still new to the community so I might have made a lot of format mistake. If you have any suggestion, please let me know.

Here's the part of code I'm using to get the 2D list

#273 = number of cases
feature_list=[[0]*len(mega_list)]*273
#create counters to use for index later
link_count=0
feature_count=0
#print len(mega_list)
for link in url_list[:-1]:

    #setup the url
    samp_url='http://www.mtsamples.com'+link
    samp_url = "%20".join( samp_url.split() )

    #soup it for keywords
    samp_soup=BeautifulSoup(urllib2.urlopen(samp_url).read())
    keywords=samp_soup.find('meta')['content']
    keywords=keywords.split(',')

    for keys in keywords:
        #print 'megalist: '+ str(mega_list.index(keys))
        if keys in mega_list:
            feature_list[link_count][mega_list.index(keys)]=1

mega_list: a list with all keywords

feature_list: the 2D list, with any word in mega_list, that specific cell is set to 1, otherwise 0

Solution

I would store the data in a pandas data frame instead of a 2D list. If I understand your data right you could do that like this:

import pandas as pd

df = pd.DataFrame(feature_list, columns = mega_list)

I don't see any mention of a dependent variable, but I'm assuming you have one because you mentioned a classifier algorithm. If your dependent variable is called "Y" and is in a list format with indices that align with your features, then this code will work for you:

from sklearn import cross_validation

x_train, x_test, y_train, y_test = cross_validation.train_test_split(
    df, Y, test_size=0.8, random_state=0)