I have a text file that has values that need to be extracted and each value is a variable length. The length of each variable is stored in a List<int>
, this can change if there is a more efficient way.
The Problem: What is the fastest way to get the variable length substrings into a DataTable
given a List<int>
of lengths?
Example text file contents:
Field1ValueField2ValueIsLongerField3Field4IsExtremelyLongField5IsProbablyTheLongestFieldOfThemAll
A1201605172B160349150816431572C16584D31601346427946121346E674306102966595346438476174959205395664
Example List<int>
:
11, 19, 6, 21, 40
Example output DataTable
:
Field 1 | Field 2 | Field 3 | Field 4 | Field 5 |
---|---|---|---|---|
Field1Value | Field2ValueIsLonger | Field3 | Field4IsExtremelyLong | Field5IsProbablyTheLongestFieldOfThemAll |
A1201605172 | B160349150816431572 | C16584 | D31601346427946121346 | E674306102966595346438476174959205395664 |
There is no pattern to the field values, could be any alphanumeric value, and can only get the field values via the length list.
My approach was as follows:
List<int> lengths = new() { 11, 19, 6, 21, 40};
DataTable dataTable = new();
//Add Columns for each field
foreach (int i in lengths)
{
dataTable.Columns.Add();
}
//Read file and get fields
using (StreamReader streamReader = new(fileName))
{
string line; //temp
while ((line = streamReader.ReadLine()) != null)
{
//Create new row each time we see a new line in the text file
DataRow dataRow = dataTable.NewRow();
//Temp counter for starting index of substring
int tempCounter = 0;
//Enumerate through variable lengths
foreach (int i in lengths)
{
//Set the value for tat cell
dataRow[lengths.IndexOf(i)] = line.Substring(tempCounter, i);
//Add the length of the current field
tempCounter += i;
}
//Add Row to DataTable
dataTable.Rows.Add(dataRow);
}
}
Is there a more efficient (time and/or memory) way of completing this task?
Are you producing that input string or that length array?
If yes:
If no:
because when you do both extracting and parsing in same loop, the extracting throughput drops. So you should offload the work to other threads, maybe with N fields at once to tolerate multi-threading synchronization latency.
If extracting by single-thread is too slow compared to multi-thread parsing, then you can try to vectorize the extracting. Launch 128 char samplers at once, check if they find a prefix code and do a reduction between them to find the first prefix in them (if they find multiple).