Search code examples
c#datatablesubstringstreamreader

Fastest way to get variable substring lengths from file (C#)


I have a text file that has values that need to be extracted and each value is a variable length. The length of each variable is stored in a List<int>, this can change if there is a more efficient way.

The Problem: What is the fastest way to get the variable length substrings into a DataTable given a List<int> of lengths?

Example text file contents:

Field1ValueField2ValueIsLongerField3Field4IsExtremelyLongField5IsProbablyTheLongestFieldOfThemAll
A1201605172B160349150816431572C16584D31601346427946121346E674306102966595346438476174959205395664

Example List<int>:

11, 19, 6, 21, 40

Example output DataTable:

Field 1 Field 2 Field 3 Field 4 Field 5
Field1Value Field2ValueIsLonger Field3 Field4IsExtremelyLong Field5IsProbablyTheLongestFieldOfThemAll
A1201605172 B160349150816431572 C16584 D31601346427946121346 E674306102966595346438476174959205395664

There is no pattern to the field values, could be any alphanumeric value, and can only get the field values via the length list.

My approach was as follows:

List<int> lengths = new() { 11, 19, 6, 21, 40};

DataTable dataTable = new();

//Add Columns for each field
foreach (int i in lengths)
{
    dataTable.Columns.Add();
}

//Read file and get fields
using (StreamReader streamReader = new(fileName))
{
    string line; //temp
    while ((line = streamReader.ReadLine()) != null)
    {
        //Create new row each time we see a new line in the text file
        DataRow dataRow = dataTable.NewRow();

        //Temp counter for starting index of substring
        int tempCounter = 0;

        //Enumerate through variable lengths
        foreach (int i in lengths)
        {
            //Set the value for tat cell
            dataRow[lengths.IndexOf(i)] = line.Substring(tempCounter, i);

            //Add the length of the current field
            tempCounter += i;
        }

        //Add Row to DataTable
        dataTable.Rows.Add(dataRow);
    }
}

Is there a more efficient (time and/or memory) way of completing this task?


Solution

  • Are you producing that input string or that length array?

    If yes:

    • save index of every Nth field starting character (if you already have length-array, then you can build a start-array too)
    • then when decoding, use multiple threads to parse multiple index points at once and join them on a target list or array (imo an array must be faster since you have total number of fields)

    If no:

    • push every encountered field start into a queue(with their field index) and jump directly to next field
    • asynchronously pop elements from queue by other threads and place them into the list accordingly with their index (array could be better if total length known)

    because when you do both extracting and parsing in same loop, the extracting throughput drops. So you should offload the work to other threads, maybe with N fields at once to tolerate multi-threading synchronization latency.

    If extracting by single-thread is too slow compared to multi-thread parsing, then you can try to vectorize the extracting. Launch 128 char samplers at once, check if they find a prefix code and do a reduction between them to find the first prefix in them (if they find multiple).