Search code examples
pythoncsvnumpypandaspython-itertools

Split timestamp column into two new columns in CSV using python and pandas


I have a large CSV file with over 210000 rows. I am new to python and pandas. I would like to efficently loop through the timestamp column, Split the timestamp column into 2 new columns (date and time) and then format the new date column to %Y%m%d and delete the new time column .i.e only write back to CSV file the new formated date column. How do you do this ?

input file sample:

   minit,timestamp,open,high,low,close
   0,2009-02-23 17:32:00,1.2708,1.2708,1.2706,1.2706
   1,2009-02-23 17:33:00,1.2708,1.2708,1.2705,1.2706
   2,2009-02-23 17:34:00,1.2706,1.2707,1.2702,1.2702
   3,2009-02-23 17:35:00,1.2704,1.2706,1.27,1.27
   4,2009-02-23 17:36:00,1.2701,1.2706,1.2698,1.2703
   5,2009-02-23 17:37:00,1.2703,1.2703,1.27,1.2702
   6,2009-02-23 17:38:00,1.2701,1.2701,1.2696,1.2697

output file sample:

   minit,date,open,high,low,close
   0,20090223,1.2708,1.2708,1.2706,1.2706
   1,20090223,1.2708,1.2708,1.2705,1.2706
   2,20090223,1.2706,1.2707,1.2702,1.2702
   3,20090223,1.2704,1.2706,1.27,1.27
   4,20090223,1.2701,1.2706,1.2698,1.2703
   5,20090223,1.2703,1.2703,1.27,1.2702
   6,20090223,1.2701,1.2701,1.2696,1.2697

I started writing a sample code to accomplish this after i googled :

     import csv
     import itertools
     import operator
     import time
     import datetime
     import pandas as pd
     from pandas import DataFrame, Timestamp
     from numpy import *

     def datestring_to_timestamp(str):
         return time.mktime(time.strptime(str, "%Y-%m-%d %H:%M:%S"))

     def timestamp_to_datestring(timestamp):
        return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(timestamp))

     def timestamp_to_float(str):
        return float(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))

     def timestamp_to_intstring(str):
        return datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s")

    def timestamp_to_int(str):
        return int(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))

    with open("inputfile.csv", 'rb') as input, open('outputfile.csv', 'wb') as output:
       reader = csv.reader(input, delimiter = ',')
       writer = csv.writer(output, delimiter = ',')

    # Need to process loop or process the timestamp column 

Solution

  • You can specify a date format string in the params to to_csv which will output your dates how you like, no need to extract/convert/add new columns etc.

    So load data using read_csv:

    df = pd.read_csv('mydata.csv', parse_dates=['timestamp']
    
    In [15]:
    
    df
    Out[15]:
       minit           timestamp    open    high     low   close
    0      0 2009-02-23 17:32:00  1.2708  1.2708  1.2706  1.2706
    1      1 2009-02-23 17:33:00  1.2708  1.2708  1.2705  1.2706
    2      2 2009-02-23 17:34:00  1.2706  1.2707  1.2702  1.2702
    3      3 2009-02-23 17:35:00  1.2704  1.2706  1.2700  1.2700
    4      4 2009-02-23 17:36:00  1.2701  1.2706  1.2698  1.2703
    5      5 2009-02-23 17:37:00  1.2703  1.2703  1.2700  1.2702
    6      6 2009-02-23 17:38:00  1.2701  1.2701  1.2696  1.2697
    

    You can rename the column if you want at this stage, we can then pass the param date_format='%Y%m%d' toto_csv` and this will just output the date portion to the csv, we can reload it and display what it has saved:

    In [19]:
    
    df.rename(columns={'timestamp':'date'},inplace=True)
    df.to_csv(r'c:\data\date.csv', date_format='%Y%m%d')
    df1 = pd.read_csv(r'C:\data\date.csv', index_col=[0])
    df1
    Out[19]:
       minit      date    open    high     low   close
    0      0  20090223  1.2708  1.2708  1.2706  1.2706
    1      1  20090223  1.2708  1.2708  1.2705  1.2706
    2      2  20090223  1.2706  1.2707  1.2702  1.2702
    3      3  20090223  1.2704  1.2706  1.2700  1.2700
    4      4  20090223  1.2701  1.2706  1.2698  1.2703
    5      5  20090223  1.2703  1.2703  1.2700  1.2702
    6      6  20090223  1.2701  1.2701  1.2696  1.2697