Search code examples
pythoncsvpython-ggplot

How can I read and group this CSV data?


The csv looks like this. '|' means different columns.

2014-09-01 | I love chicken

2014-09-01 | I eat chicken

2014-09-02 | She loves chicken

2014-09-02 | Ha ha ha I love chicken

2014-09-03 | Blah Blah Blah

I want to treat the data so it would look like this.

2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |

2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |

2014-09-03 | 'blah', 3 |

DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...

What approach should I use here? I ultimately want to plot a graph that shows Date on x-axis and word counts (frequency) on the y-axis.

Below is my best approach yet.

TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"

with open('Simplified.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        if str(row[0:1])[2:12] == TestStartDate:
            #str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
            tagger = MeCab.Tagger()
            rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
            #print rose
            wordCount = {}
            wordList = rose.split()[:-1:2]
            for word in wordList:
                wordCount.setdefault(word, 0)
                wordCount[word] += 1
            for word, count in wordCount.items():
                print '"%s, %i"' % (word, count)

I plan to add word and count into Data.


Solution

  • this works for me ~ and do you really need the last '|' ? because when you split it with '|' again when you put it into matplotlib or something else, you 'll get a '' in your result.

    the code below will not append a '|' to each row of result, if you think it's necessary, just append a '|' to the function d, like this:

    return '%s| %s|'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
    

    ===========

    def d(s):
        tokens = s.split('|')
        words = tokens[-1].strip().lower().split(' ')
        return '%s| %s'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
    
    def wordcount():
        lines=[
            '2014-09-01 | I love chicken',
            '2014-09-01 | I eat chicken',
            '2014-09-02 | She loves chicken',
            '2014-09-02 | Ha ha ha I love chicken',
            '2014-09-03 | Blah Blah Blah'
        ]
        rows={}
        for line in lines:
            t_line = line.split(' | ')
            if t_line[0] not in rows:
                rows[t_line[0]]=''
            rows[t_line[0]]+=(' '+t_line[-1])
        newrows=[]
        for k,v in rows.items():
            newrows.append(d('%s | %s'%(k,v)))
        print '\n'.join(newrows)
    
    
    >>2014-09-02 | 'love',1|'i',1|'she',1|'loves',1|'chicken',2|'ha',3
    >>2014-09-03 | 'blah',3
    >>2014-09-01 | 'i',2|'chicken',2|'love',1|'eat',1