Okay, I don't know if I phrased it badly or something, but I can't seem to find anything similar here for my problem.
So I have a 2D list, each row representing a case and each column representing a feature (for machine learning). In addition, I have a separated list (column) as labels.
I want to randomly select the rows from the 2D list to train a classifier while using the rest to test for accuracy. Thus I want to be able to know all the indices of rows I used for training to avoid repeats.
I think there are 2 parts of the question: 1) how to randomly select 2) how to get indices
again I have no idea why I can't find good info here by searching (maybe I just suck)
Sorry I'm still new to the community so I might have made a lot of format mistake. If you have any suggestion, please let me know.
Here's the part of code I'm using to get the 2D list
#273 = number of cases
feature_list=[[0]*len(mega_list)]*273
#create counters to use for index later
link_count=0
feature_count=0
#print len(mega_list)
for link in url_list[:-1]:
#setup the url
samp_url='http://www.mtsamples.com'+link
samp_url = "%20".join( samp_url.split() )
#soup it for keywords
samp_soup=BeautifulSoup(urllib2.urlopen(samp_url).read())
keywords=samp_soup.find('meta')['content']
keywords=keywords.split(',')
for keys in keywords:
#print 'megalist: '+ str(mega_list.index(keys))
if keys in mega_list:
feature_list[link_count][mega_list.index(keys)]=1
mega_list: a list with all keywords
feature_list: the 2D list, with any word in mega_list, that specific cell is set to 1, otherwise 0
I would store the data in a pandas data frame instead of a 2D list. If I understand your data right you could do that like this:
import pandas as pd
df = pd.DataFrame(feature_list, columns = mega_list)
I don't see any mention of a dependent variable, but I'm assuming you have one because you mentioned a classifier algorithm. If your dependent variable is called "Y" and is in a list format with indices that align with your features, then this code will work for you:
from sklearn import cross_validation
x_train, x_test, y_train, y_test = cross_validation.train_test_split(
df, Y, test_size=0.8, random_state=0)