I have such example of my training Data(i have 1000 films for training), I need to predict a 'budget' of each film:
film_1 = {
'title': 'The Hobbit: An Unexpected Journey',
'article_size': 25000,
'producer': ['Peter Jackson', 'Fran Walsh', 'Zane Weiner'],
'release_date': some_date(2013, 11, 28),
'running_time': 169,
'country': ['New Zealand', 'UK', 'USA'],
'budget': dec('200000000')
}
The keys such as 'title'
, 'producer'
, 'country'
can be viewed as features in machine learning, while values such as 'The Hobbit: An Unexpected Journey'
, 25000
, etc.,can be viewed as values used for learning process. However, in training, the input is mostly accepted as real numbers rather than strings format. Do I need to convert such fields like 'title'
, 'producer'
, 'country'
(fields which are strings) to int
( such thing like classification or serialization should take place?) or some other manipulations to make me able to use these data as training set for my network?
I was wondering whether this is what you need:
film_list=['title','article_size','producer','release_date','running_time','country','budget']
flist = [(i,j) for i, j in enumerate(film_list)]
label = [ seq[0] for seq in flist ]
name = [ seq[1] for seq in flist ]
print label
print name
>>[0, 1, 2, 3, 4, 5, 6]
['title', 'article_size', 'producer', 'release_date', 'running_time', 'country', 'budget']
Or you can use your dictionary directly,
labels = film_1.keys()
print labels
# But the keys are sorted, labels[0] will give you 'producer' instead of 'title':
>>['producer', 'title', 'country', 'release_date', 'budget', 'article_size', 'running_time']