I have a very large data set which looks like this in
Column A
Date
2016-02-29 15:59:59.674
2016-02-29 15:59:59.695
2016-02-29 15:59:59.716
2016-02-29 15:59:59.752
2016-02-29 15:59:59.804
2016-02-29 15:59:59.869
2016-02-29 15:59:59.888
2016-02-29 15:59:59.941
2016-02-29 16:00:00.081 <-- get closest date since .081 < .941
2016-02-29 16:00:00.168
2016-02-29 16:00:00.189
2016-02-29 16:00:00.198
2016-02-29 16:00:00.247
2016-02-29 16:00:00.311
2016-02-29 16:00:00.345
2016-02-29 16:00:00.357
and for the other half an hour
2016-02-29 16:29:58.628
2016-02-29 16:29:58.639
2016-02-29 16:29:58.689
2016-02-29 16:29:58.706
2016-02-29 16:29:58.761
2016-02-29 16:29:58.865
2016-02-29 16:29:59.142
2016-02-29 16:29:59.542
2016-02-29 16:29:59.578
2016-02-29 16:30:00.171 <-- Get this date since .171 < .578
2016-02-29 16:30:00.209
2016-02-29 16:30:00.217
2016-02-29 16:30:00.245
2016-02-29 16:30:00.254
2016-02-29 16:30:00.347
2016-02-29 16:30:00.422
2016-02-29 16:30:00.457
2016-02-29 16:30:00.491
2016-02-29 16:30:00.555
2016-02-29 16:30:00.557
2016-02-29 16:30:00.645
Now total rows in the data set is about 5468389 which is very large for excel to import everything in one column so I am trying to process data in parts.
Is there any other approach to this ? By which I can process all the data ?
I tried to read and write to text directly but whenever I tried to read it as date it gave me a Type Mismatch
error because of the format. For the same reason I didn't go with python for this problem also because I am not proficient with python as well so I thought of doing this in Excel VBA.
Also I am not quite sure with this logic so I need some help there.
Option Explicit
Sub Get_Closest_Dates()
Application.ScreenUpdating = False
Dim WI As Worksheet, WO As Worksheet
Dim i As Long, ct As Long
Dim num1 As Integer, num2 As Integer, num3 As Integer
Dim df1, df2
Set WI = Sheet1 'INPUT SHEET
Set WO = Sheet2 'OUTPUT SHEET
WI.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"
WO.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS"
WI.Range("B1") = "HOUR"
WI.Range("C1") = "MINUTE"
With WI
.Range("B2").Formula = "=HOUR(A2)"
.Range("B2:B" & Rows.Count).FillDown
.Range("C2").Formula = "=MINUTE(A2)"
.Range("C2:C" & Rows.Count).FillDown
ct = WO.Range("A" & Rows.Count).End(xlUp).Row + 1
For i = 2 To 10000
num1 = .Range("C" & i).Value 'get Minutes
num2 = .Range("C" & i + 1).Value
If (num1 = 29 And num2 = 30) Then
df1 = 0.5 - TimeValue(.Range("A" & i))
df2 = TimeValue(.Range("A" & i + 1)) - 0.5
If df1 < df2 Then
WO.Range("A" & ct) = .Range("A" & i)
ct = ct + 1
Else
WO.Range("A" & ct) = .Range("A" & i + 1)
ct = ct + 1
End If
End If
If (num1 = 59 And num2 = 0) Then
df1 = 1 - TimeValue(.Range("A" & i))
df2 = TimeValue(.Range("A" & i + 1)) - 1
If df1 < df2 Then
WO.Range("A" & ct) = .Range("A" & i)
ct = ct + 1
Else
WO.Range("A" & ct) = .Range("A" & i + 1)
ct = ct + 1
End If
End If
Next i
End With
Application.ScreenUpdating = True
MsgBox "Process Completed"
End Sub
Also I am not sure how I can get the millisecond part from the date that would avoid calculating the difference of the two dates
like 15:59:59.674 how can I get 674
from the time ?
Seems like your first problem is getting the data into Excel. Understanding that Excel may not be the best program for processing such large amounts of data (a DB program such as Access might be better), you need to either split the data amongst multiple columns or worksheets; or take a sample of the data.
You have opted to take a sample, so I would do the sampling and testing as you read in the data.
You also have to deal the an Excel/VBA limitation in processing date/time stamps that include milliseconds.
But for the purpose of testing the data, there is no need to be concerned with milliseconds. So long as your data is in ascending order, then the first line that has a date/time stamp that is at or above your 30 minute increment will be the earliest one.
The code below should read only the lines of your huge file that meet that criteria. Please read the comments for extra information.
The lines are collected into a collection; and then a results array is declared, filled, and the results written to a worksheet.
If each line consists of multiple fields, and not just the single line you show, then, at the time of writing the results, you would declare the results array to hold all the columns, fill it at that time.
Using the Collection / Array / write to the worksheet sequence will be much faster than writing each line, one at a time, to the worksheet as you process it.
There are methods to speed up the code, and also methods to deal with possible "out of memory" errors, but that depends on your real data and how things go with this simple code.
So far as converting the date/time stamps, which we need, for now, to have Excel interpret as strings, into "real" date/times, that depends on what you want to do with the subsequent data.
==========================================
Option Explicit
'Set Reference to Microsoft Scripting Runtime
Sub GetBigData()
Dim FSO As FileSystemObject
Dim TS As TextStream
Dim vFileName As Variant
Dim sLine As String
Dim dtLineTime As Date
Dim dtNextTime As Date
Dim colLines As Collection
vFileName = Application.GetOpenFilename("Text Files(*.txt), *.txt")
If vFileName = False Then Exit Sub
Set FSO = New FileSystemObject
Set TS = FSO.OpenTextFile(vFileName, ForReading, False, TristateFalse)
Set colLines = New Collection
With TS
'Assumes date/time stamps are contiguous
'skip any header lines
Do
sLine = .ReadLine
Loop Until InStr(sLine, ".") > 0
'Compute first "NextTime"
' note that it might be the first entry
' comment line 3 below if want first entry
' but would need to add logic if using other time increments
dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
dtNextTime = Int(dtLineTime) + TimeSerial(Hour(dtLineTime), Int(Minute(dtLineTime) / 30) * 30, 0)
If Not (Minute(dtLineTime) = 30 Or Minute(dtLineTime) = 60) Then dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
Do
'Due to IEEE rounding problems, need to test equality as a very small value
'Could use a value less than 1 second = 1/86400 or smaller
If Abs(dtLineTime - dtNextTime) < 0.00000001 Or _
dtLineTime > dtNextTime Then
colLines.Add sLine
dtNextTime = dtNextTime + TimeSerial(0, 30, 0)
End If
If Not .AtEndOfStream Then
sLine = .ReadLine
dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1))
End If
Loop Until .AtEndOfStream
.Close
End With
'Write the collection to the worksheet
Dim V As Variant
Dim wsResults As Worksheet, rResults As Range
Dim I As Long
Set wsResults = Worksheets("sheet1")
Set rResults = wsResults.Cells(1, 1)
ReDim V(1 To colLines.Count, 1 To 1)
Set rResults = rResults.Resize(UBound(V, 1), UBound(V, 2))
For I = 1 To UBound(V, 1)
V(I, 1) = CStr(colLines(I))
Next I
With rResults
.EntireColumn.Clear
.NumberFormat = "@"
.Value = V
.EntireColumn.AutoFit
End With
End Sub
==========================================
EDIT Time stamp conversion function added. This could be implemented at the point where the data is copied from the collection object to the variant array. EG:
V(I, 1) = ConvertTimeStamp(colLines(I))
Since the value received is a Double data type, you'll need to also format appropriately that column on the worksheet, instead of having it as Text:
.NumberFormat = "yyyy-mm-dd hh:mm:ss.000"
We have to return the value as a Double since the VBA Date type data does not support milliseconds.
==============================
Private Function ConvertTimeStamp(sTmStmp As String) As Double
Dim dtPart As Date
Dim dMS As Double 'milliseconds
Dim V As Variant
'Convert the date and time
V = Split(sTmStmp, ".")
dtPart = CDate(V(0))
dMS = V(1)
ConvertTimeStamp = dtPart + dMS / 86400 / 1000
End Function
==============================