The csv looks like this. '|' means different columns.
2014-09-01 | I love chicken
2014-09-01 | I eat chicken
2014-09-02 | She loves chicken
2014-09-02 | Ha ha ha I love chicken
2014-09-03 | Blah Blah Blah
I want to treat the data so it would look like this.
2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |
2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |
2014-09-03 | 'blah', 3 |
DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...
What approach should I use here? I ultimately want to plot a graph that shows Date on x-axis and word counts (frequency) on the y-axis.
Below is my best approach yet.
TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"
with open('Simplified.csv') as f:
reader = csv.reader(f)
for row in reader:
if str(row[0:1])[2:12] == TestStartDate:
#str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
tagger = MeCab.Tagger()
rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
#print rose
wordCount = {}
wordList = rose.split()[:-1:2]
for word in wordList:
wordCount.setdefault(word, 0)
wordCount[word] += 1
for word, count in wordCount.items():
print '"%s, %i"' % (word, count)
I plan to add word and count into Data.
this works for me ~ and do you really need the last '|' ? because when you split it with '|' again when you put it into matplotlib or something else, you 'll get a '' in your result.
the code below will not append a '|' to each row of result, if you think it's necessary, just append a '|' to the function d, like this:
return '%s| %s|'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
===========
def d(s):
tokens = s.split('|')
words = tokens[-1].strip().lower().split(' ')
return '%s| %s'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
def wordcount():
lines=[
'2014-09-01 | I love chicken',
'2014-09-01 | I eat chicken',
'2014-09-02 | She loves chicken',
'2014-09-02 | Ha ha ha I love chicken',
'2014-09-03 | Blah Blah Blah'
]
rows={}
for line in lines:
t_line = line.split(' | ')
if t_line[0] not in rows:
rows[t_line[0]]=''
rows[t_line[0]]+=(' '+t_line[-1])
newrows=[]
for k,v in rows.items():
newrows.append(d('%s | %s'%(k,v)))
print '\n'.join(newrows)
>>2014-09-02 | 'love',1|'i',1|'she',1|'loves',1|'chicken',2|'ha',3
>>2014-09-03 | 'blah',3
>>2014-09-01 | 'i',2|'chicken',2|'love',1|'eat',1