python ram large-data large-files recording

Collecting large amounts of data with python

I'm currently using a python script to collect signal data from an outside source (a digitiser) for about 10 seconds. This data is recorded as an array and is subsequently being saved as a text file using numpy.savetxt to a hard drive on a PC. This is an excerpt of the script as it currently is:

#vs Pia
import visa
import time
import re
import datetime

from PyDAQmx import *
from ctypes import *
import nidaqmx

import numpy


##############DATA COLLECTION WITH DIGITISER###############


# initialize variables

N = 2**14
schrate = 1600   #samples per second per channel


taskHandle = TaskHandle(0)
read = int32()
data = np.zeros((N,), dtype=np.float64)

DAQmxCreateTask("", byref(taskHandle))
DAQmxCreateAIVoltageChan(taskHandle, "Dev1/ai4:5", "", DAQmx_Val_RSE, -10.0, 
10.0, DAQmx_Val_Volts, None)
DAQmxCfgSampClkTiming(taskHandle, "", schrate, DAQmx_Val_Rising, 
DAQmx_Val_FiniteSamps, N)


# begin data collection

DAQmxStartTask(taskHandle)

DAQmxReadAnalogF64(taskHandle, -1, 30, DAQmx_Val_GroupByScanNumber, data, N, 
byref(read), None)

DAQmxStopTask(taskHandle)
DAQmxClearTask(taskHandle)


#############SAVING DATA##############

dataX = data[::2]

time = np.linspace(0,(N/2)/schrate,N/2)

filename = "Xquad"
print("Saving X-quadrature to file: "+filename)
np.savetxt(filename, dataX[None,:], delimiter=',',newline='\n')

filename = "recorded_time"
print("Saving recorded time to file: "+filename)
numpy.savetxt(filename, time[None,:], delimiter=',',newline='\n')

The first part of the code is merely extracting the data from the digitiser and recording it in an array named "data". The second part of the code is saving the relevant data I need, named as "dataX", as well as the total time the data was recorded for, both as separate text files.

So basically running this script to collect data for 10 seconds is fine, however, the long term goal is to continually collect data for long periods of time (up to months at a time). Unfortunately, the finite amount of RAM in the PC means this script can't just be run indefinitely as performance and memory issues will eventually start to become a factor.

The only proposed solution I have come up with so far is to periodically save the data array as a text file to the hard drive, and use an if loop to check if the text file has reached a specified file size. If it has reached the specified size, the new incoming data will be saved to a new text file, at which point the whole process will be repeated until I terminate the script. However, this solution is less than ideal as it would take time each time the text file is saved (especially when the text file gets very large). These 'hiccups" in time could create inconsistencies in the timing of data collection.

Has anyone had any experience collecting data for indefinite periods of time using python? Are there any better ways to account for large amounts of data filling up the RAM?

Solution

In general, numpy will already use the least amount of memory for a list of N numbers of a particular type which must be known before hand - this is being used already.

If that is still not sufficient for the memory requirements of the application, consider changing the architecture so that the collection device does not store any data itself.

Instead, it can only collect and send data over the network to an external service for storing and presentation.

There are standard services for this like Kafka and/or, depending on the use case, databases like InfluxDB that can be used as data sources for visualisation dashboards like Grafana.

To keep the current architecture it would be best to run the collection in a separate thread so it can continue collecting while saving the current data to disk. Numpy is thread safe and releases the GIL so this is not an issue.

The device must be able to store enough data for poll time + save time duration in this case as the to-be-saved data should be deleted, while polling must continue and store incoming data while the old data is stored to disk.