Search code examples
vbaexceldatetimelarge-data

How to get the closest date after every half an hour


I have a very large data set which looks like this in

    Column A
       Date
2016-02-29 15:59:59.674
2016-02-29 15:59:59.695
2016-02-29 15:59:59.716
2016-02-29 15:59:59.752
2016-02-29 15:59:59.804
2016-02-29 15:59:59.869
2016-02-29 15:59:59.888
2016-02-29 15:59:59.941
2016-02-29 16:00:00.081 <-- get closest date since .081 < .941
2016-02-29 16:00:00.168
2016-02-29 16:00:00.189
2016-02-29 16:00:00.198
2016-02-29 16:00:00.247
2016-02-29 16:00:00.311
2016-02-29 16:00:00.345
2016-02-29 16:00:00.357

and for the other half an hour

2016-02-29 16:29:58.628
2016-02-29 16:29:58.639
2016-02-29 16:29:58.689
2016-02-29 16:29:58.706
2016-02-29 16:29:58.761
2016-02-29 16:29:58.865
2016-02-29 16:29:59.142
2016-02-29 16:29:59.542
2016-02-29 16:29:59.578
2016-02-29 16:30:00.171 <-- Get this date since .171 < .578
2016-02-29 16:30:00.209
2016-02-29 16:30:00.217
2016-02-29 16:30:00.245
2016-02-29 16:30:00.254
2016-02-29 16:30:00.347
2016-02-29 16:30:00.422
2016-02-29 16:30:00.457
2016-02-29 16:30:00.491
2016-02-29 16:30:00.555
2016-02-29 16:30:00.557
2016-02-29 16:30:00.645

Now total rows in the data set is about 5468389 which is very large for excel to import everything in one column so I am trying to process data in parts.

Is there any other approach to this ? By which I can process all the data ? I tried to read and write to text directly but whenever I tried to read it as date it gave me a Type Mismatch error because of the format. For the same reason I didn't go with python for this problem also because I am not proficient with python as well so I thought of doing this in Excel VBA.

Also I am not quite sure with this logic so I need some help there.

Option Explicit

Sub Get_Closest_Dates()

Application.ScreenUpdating = False

Dim WI As Worksheet, WO As Worksheet
Dim i As Long, ct As Long
Dim num1 As Integer, num2 As Integer, num3 As Integer
Dim df1, df2


Set WI = Sheet1 'INPUT SHEET
Set WO = Sheet2 'OUTPUT SHEET

WI.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"
WO.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"

WI.Range("B1") = "HOUR"
WI.Range("C1") = "MINUTE"

With WI

    .Range("B2").Formula = "=HOUR(A2)"
    .Range("B2:B" & Rows.Count).FillDown

    .Range("C2").Formula = "=MINUTE(A2)"
    .Range("C2:C" & Rows.Count).FillDown

ct = WO.Range("A" & Rows.Count).End(xlUp).Row + 1

For i = 2 To 10000

    num1 = .Range("C" & i).Value    'get Minutes
    num2 = .Range("C" & i + 1).Value

    If (num1 = 29 And num2 = 30) Then

        df1 = 0.5 - TimeValue(.Range("A" & i))
        df2 = TimeValue(.Range("A" & i + 1)) - 0.5

        If df1 < df2 Then
            WO.Range("A" & ct) = .Range("A" & i)
            ct = ct + 1
        Else
            WO.Range("A" & ct) = .Range("A" & i + 1)
            ct = ct + 1
        End If

    End If


    If (num1 = 59 And num2 = 0) Then
        df1 = 1 - TimeValue(.Range("A" & i))
        df2 = TimeValue(.Range("A" & i + 1)) - 1

        If df1 < df2 Then
            WO.Range("A" & ct) = .Range("A" & i)
            ct = ct + 1
        Else
            WO.Range("A" & ct) = .Range("A" & i + 1)
            ct = ct + 1
        End If
    End If

Next i

End With

Application.ScreenUpdating = True
MsgBox "Process Completed"

End Sub

Also I am not sure how I can get the millisecond part from the date that would avoid calculating the difference of the two dates

like 15:59:59.674 how can I get 674 from the time ?


Solution

  • Seems like your first problem is getting the data into Excel. Understanding that Excel may not be the best program for processing such large amounts of data (a DB program such as Access might be better), you need to either split the data amongst multiple columns or worksheets; or take a sample of the data.

    You have opted to take a sample, so I would do the sampling and testing as you read in the data.

    You also have to deal the an Excel/VBA limitation in processing date/time stamps that include milliseconds.

    But for the purpose of testing the data, there is no need to be concerned with milliseconds. So long as your data is in ascending order, then the first line that has a date/time stamp that is at or above your 30 minute increment will be the earliest one.

    The code below should read only the lines of your huge file that meet that criteria. Please read the comments for extra information.

    The lines are collected into a collection; and then a results array is declared, filled, and the results written to a worksheet.

    If each line consists of multiple fields, and not just the single line you show, then, at the time of writing the results, you would declare the results array to hold all the columns, fill it at that time.

    Using the Collection / Array / write to the worksheet sequence will be much faster than writing each line, one at a time, to the worksheet as you process it.

    There are methods to speed up the code, and also methods to deal with possible "out of memory" errors, but that depends on your real data and how things go with this simple code.

    So far as converting the date/time stamps, which we need, for now, to have Excel interpret as strings, into "real" date/times, that depends on what you want to do with the subsequent data.

    ==========================================

    Option Explicit
    'Set Reference to Microsoft Scripting Runtime
    Sub GetBigData()
        Dim FSO As FileSystemObject
        Dim TS As TextStream
        Dim vFileName As Variant
        Dim sLine As String
        Dim dtLineTime As Date
        Dim dtNextTime As Date
        Dim colLines As Collection
    
    vFileName = Application.GetOpenFilename("Text Files(*.txt), *.txt")
    If vFileName = False Then Exit Sub
    
    Set FSO = New FileSystemObject
    Set TS = FSO.OpenTextFile(vFileName, ForReading, False, TristateFalse)
    Set colLines = New Collection
    
    With TS
        'Assumes date/time stamps are contiguous
        'skip any header lines
        Do
            sLine = .ReadLine
        Loop Until InStr(sLine, ".") > 0
    
    'Compute first "NextTime"
    '  note that it might be the first entry
    '  comment line 3 below if want first entry
    '  but would need to add logic if using other time increments
    dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
    dtNextTime = Int(dtLineTime) + TimeSerial(Hour(dtLineTime), Int(Minute(dtLineTime) / 30) * 30, 0)
    If Not (Minute(dtLineTime) = 30 Or Minute(dtLineTime) = 60) Then dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
    
    Do
        'Due to IEEE rounding problems, need to test equality as a very small value
        'Could use a value less than 1 second = 1/86400 or smaller
        If Abs(dtLineTime - dtNextTime) < 0.00000001 Or _
            dtLineTime > dtNextTime Then
                colLines.Add sLine
                dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
        End If
        If Not .AtEndOfStream Then
            sLine = .ReadLine
            dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
        End If
    Loop Until .AtEndOfStream
    
    .Close
    End With
    
    'Write the collection to the worksheet
    Dim V As Variant
    Dim wsResults As Worksheet, rResults As Range
    Dim I As Long
    
    Set wsResults = Worksheets("sheet1")
    Set rResults = wsResults.Cells(1, 1)
    
    ReDim V(1 To colLines.Count, 1 To 1)
    Set rResults = rResults.Resize(UBound(V, 1), UBound(V, 2))
    
    For I = 1 To UBound(V, 1)
         V(I, 1) = CStr(colLines(I))
    Next I
    
    With rResults
        .EntireColumn.Clear
        .NumberFormat = "@"
        .Value = V
        .EntireColumn.AutoFit
    End With
    
    End Sub
    

    ==========================================

    EDIT Time stamp conversion function added. This could be implemented at the point where the data is copied from the collection object to the variant array. EG:

    V(I, 1) = ConvertTimeStamp(colLines(I))
    

    Since the value received is a Double data type, you'll need to also format appropriately that column on the worksheet, instead of having it as Text:

    .NumberFormat = "yyyy-mm-dd hh:mm:ss.000"
    

    We have to return the value as a Double since the VBA Date type data does not support milliseconds.

    ==============================

    Private Function ConvertTimeStamp(sTmStmp As String) As Double
        Dim dtPart As Date
        Dim dMS As Double 'milliseconds
        Dim V As Variant
    
    'Convert the date and time
    V = Split(sTmStmp, ".")
    dtPart = CDate(V(0))
    dMS = V(1)
    
    ConvertTimeStamp = dtPart + dMS / 86400 / 1000
    
    End Function
    

    ==============================