Search code examples
rplotgnuplot

How do visualize the rise or decline of occurrence of a word in a data set with timestamps


Im trying to make a graph like this (actually single line for simplicity sake) enter image description here

Given an input word like 'M4M" and a data set file (csv) like this

1529972216.0,Seeking Black M4M
1529972047.0,Looking for car fun 
1529971885.0,armenian M4M

How can I visualize the trend of the given word? I want to chart the occurrence of the word over the time span, to be able to tell if the word/topic is declining or increasing in popularity.

(the data set is a csv file containing in field 1 the unix epoch timestamp of craigslist posts and in field 2 the title of the craiglist posts)

In my system I have R and gnu plot installed (if that helps)
In any given day, hundreds of craiglist posts can be there.


Solution

  • gnuplot can do that. It's basically like a histogram and gnuplot has the option smooth frequency for this. If Keyword appears in the second column it will be counted and summed up. Adapt the code to your needs.

    The code:

    ### count occurrence of a word
    reset session
    
    $Data <<EOD
    1300000000.0,Seeking Green M4M
    1300000000.0,Seeking Blue M4M
    1310000000.0,Seeking Green M4M
    1320000000.0,Seeking Red M4M
    1330000000.0,Seeking Black M4M
    1340000000.0,Looking for car fun 
    1350000000.0,armenian M4M
    1360000000.0,english M4M
    1370000000.0,german M4M
    1380000000.0,french M4M
    1390000000.0,italian M4M
    1390200000.0,greek M4M
    1400000000.0,swiss M4M
    1500000000.0,spanish M4M
    EOD
    
    set datafile separator ","
    set xdata time
    set timefmt "%s"
    set format x "%Y"
    
    Keyword = "M4M"
    Binwidth = 3600.*24*7   # one week
    
    plot $Data u (floor($1/Binwidth)*Binwidth):(strstrt(strcol(2),Keyword)>0) \
        smooth freq w lp pt 7 lc rgb "red" title Keyword
    ### end of code
    

    The result:

    enter image description here

    edit Comment:

    actually, it might be misleading to plot the result with lines or linespoints (as above), because it suggests that the result between 2015 and 2017 is 1, which is not true. The plotstyle with boxes would suggest this as well. These plotstyles could only be applied if there is a value in every bin (here: every week). Well, you could set the value of all other weeks to zero. So, the "correct" plotstyle in any case would be with impulses.

    enter image description here