Search code examples
pythonpandastime-seriesclipboardcopy-paste

How can I copy DataFrames with datetimes from Stack Overflow into Python?


I see frequent pandas examples on SO using time series that have spaces within the timestamps:

                     A
2020-01-01 09:20:00  0
2020-01-01 09:21:00  1
2020-01-01 09:22:00  2
2020-01-01 09:23:00  3
2020-01-01 09:24:00  4

Or this where the times aren't part of the index:

                dates    values cat
0 2020-01-01 09:20:00  0.758513   a
1 2020-01-01 09:21:00  0.337325   b
2 2020-01-01 09:22:00  0.618372   b
3 2020-01-01 09:23:00  0.878714   b
4 2020-01-01 09:24:00  0.311069   b

Is there a good way to copy these (or similar) data back into Python to work with? I have found posts like this and this which were lifesavers for getting many examples out of SO, yet I usually fail to find a copy/paste approach (using pd.read_clipboard() or pd.read_table()) that works for such data. This often then discourages me from attempting an answer1.

The examples above were created this way:

#one
import pandas as pd
import numpy

dr = pd.date_range('01-01-2020 9:20', '01-01-2020 9:24', freq='1T')
df1 = pd.DataFrame(index=dr, data=range(len(dr)), columns=['A'])

#two
df2 = pd.DataFrame({'dates':dr,
                    'values':np.random.rand(len(dr)),
                    'cat':np.random.choice(['a','b'],len(dr))})

1. And for the record, I reckon the onus should be on posters to post their data in a more copy-able format, else not get answered. For time series information, rather than pasting the string representation of the DataFrame I always try to post constructing code (using pd.date_range() or w/e). And I imagine using something like df.to_dict() would be better if there are specific (irregular interval) dates that need to be copied for the example.


Solution

  • I usually copy the entire string and then parse it. It's not perfect and you usually have to edit both the string and the dataframe to make it usable. Here is one example. This solution was already provided in this answer. I've only added the bit about parsing date/time.

    import pandas as pd
    from io import StringIO
    from dateutil.parser import parse
    
    # I added two more column names `date` and `time`.
    # An advantage of having the string in your python code is that
    # you can edit it in your text editor/jupyter notebook quickly and directly.
    s = """date time A
    2020-01-01 09:20:00  0
    2020-01-01 09:21:00  1
    2020-01-01 09:22:00  2
    2020-01-01 09:23:00  3
    2020-01-01 09:24:00  4"""
    
    # Parse using whitespace separator. This will still not be perfect as we can
    # see below.
    df = pd.read_csv(StringIO(s), sep="\s+", index_col=False)
    df
    #          date      time  A
    # 0  2020-01-01  09:20:00  0
    # 1  2020-01-01  09:21:00  1
    # 2  2020-01-01  09:22:00  2
    # 3  2020-01-01  09:23:00  3
    # 4  2020-01-01  09:24:00  4
    
    # Combine date and time column together and drop the individual columns.
    df['datetime'] = df['date'] + " " + df['time']
    df = df.drop(['date', 'time'], axis=1)
    
    # Use a somewhat universal parser in dateutil.parser.parse to parse the
    # dates into proper dateime object.
    df['datetime'] = df['datetime'].apply(parse)
    df
    #    A            datetime
    # 0  0 2020-01-01 09:20:00
    # 1  1 2020-01-01 09:21:00
    # 2  2 2020-01-01 09:22:00
    # 3  3 2020-01-01 09:23:00
    # 4  4 2020-01-01 09:24:00
    
    df.index
    # RangeIndex(start=0, stop=5, step=1)
    
    df.dtypes
    # A                    int64
    # datetime    datetime64[ns]
    # dtype: object
    
    df.columns
    # Index(['A', 'datetime'], dtype='object')
    

    One method to provide formatted and parseable dataframe on StackOverflow is by outputting a csv-formatted string.

    # Continued from above
    print(df.to_csv(index=False))
    # A,datetime
    # 0,2020-01-01 09:20:00
    # 1,2020-01-01 09:21:00
    # 2,2020-01-01 09:22:00
    # 3,2020-01-01 09:23:00
    # 4,2020-01-01 09:24:00
    
    # We can indeed parse nicely from the csv-formatted string 
    s_redux = df.to_csv(index=False)
    pd.read_csv(StringIO(s_redux))
    #    A             datetime
    # 0  0  2020-01-01 09:20:00
    # 1  1  2020-01-01 09:21:00
    # 2  2  2020-01-01 09:22:00
    # 3  3  2020-01-01 09:23:00
    # 4  4  2020-01-01 09:24:00
    

    Here is one attempt at parsing the second example dataframe. As before, we do need to make some "edits" to the dataframe to make it usable.

    import pandas as pd
    from io import StringIO
    from dateutil.parser import parse
    
    s="""                dates    values cat
    0 2020-01-01 09:20:00  0.758513   a
    1 2020-01-01 09:21:00  0.337325   b
    2 2020-01-01 09:22:00  0.618372   b
    3 2020-01-01 09:23:00  0.878714   b
    4 2020-01-01 09:24:00  0.311069   b"""
    
    df = pd.read_csv(StringIO(s), sep="\s+").reset_index()
    df
    #    level_0     level_1     dates    values cat
    # 0        0  2020-01-01  09:20:00  0.758513   a
    # 1        1  2020-01-01  09:21:00  0.337325   b
    # 2        2  2020-01-01  09:22:00  0.618372   b
    # 3        3  2020-01-01  09:23:00  0.878714   b
    # 4        4  2020-01-01  09:24:00  0.311069   b
    
    df['dates'] = df['level_1'] + " " + df['dates']
    df = df.drop(['level_0', 'level_1'], axis=1)
    df['dates'] = df['dates'].apply(parse)
    
    df
    #                 dates    values cat
    # 0 2020-01-01 09:20:00  0.758513   a
    # 1 2020-01-01 09:21:00  0.337325   b
    # 2 2020-01-01 09:22:00  0.618372   b
    # 3 2020-01-01 09:23:00  0.878714   b
    # 4 2020-01-01 09:24:00  0.311069   b
    
    df.dtypes
    # dates     datetime64[ns]
    # values           float64
    # cat               object
    # dtype: object