Search code examples
javaarraystext-filesjava.util.scannerscientific-computing

Loading large matrix from text file into Java arrays


My data is stored in large matrices stored in text files with millions of rows and 4 columns of comma-separated values. (Each column stores a different variable, and each row stores a different millisecond's data for all four variables.) There is also some irrelevant header data in the first dozen or so lines. I need to write Java code to load this data into four arrays, with one array for each column in the text matrix.

The Java code also needs to be able to tell when the header is done, so that the first data row can be split into entries for the 4 arrays. Finally, the Java code needs to iterate through the millions of data rows, repeating the process of decomposing each row into four numbers which are each entered into the appropriate array for the column in which the number was located.

How can I alter the code below in order to accomplish this? I want to find the fastest way to accomplish this processing of millions of rows.

Here is my code:

MainClass2.java

  package packages;

public class MainClass2{
    public static void main(String[] args){
    readfile2 r = new readfile2();
    r.openFile();
    int x1Count = r.readFile();
    r.populateArray(x1Count);
    r.closeFile();  
}
}

readfile2.java

  package packages;

import java.io.*;
import java.util.*;

public class readfile2 {
private Scanner scan1;
private Scanner scan2;

public void openFile(){
    try{
        scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
        scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
    }
    catch(Exception e){
        System.out.println("could not find file");
    }
}
public int readFile(){
    int scan1Count = 0;
    while(scan1.hasNext()){
        scan1.next();
        scan1Count += 1;
    }
    return scan1Count;
}
public double[] populateArray(int scan1Count){
    double[] outputArray1 = new double[scan1Count];
    double[] outputArray2 = new double[scan1Count];
    double[] outputArray3 = new double[scan1Count];
    double[] outputArray4 = new double[scan1Count];
    int i = 0;
    while(scan2.hasNext()){
        //what code do I write here to:
        //  1.) identify the start of my time series rows after the end of the header rows (e.g. row starts with a number AT LEAST 4 digits in length.)
        //  2.) split each time series row's data into a separate new entry for each of the 4 output arrays
        i++;
    }
    return outputArray1, outputArray2, outputArray3, outputArray4;
}
public void closeFile(){
    scan1.close();
    scan2.close();
}
}

Here are the first 19 lines of a typical data file:

text and numbers on first line
1 msec/sample
3 channels
ECG
Volts
Z_Hamming_0_05_LPF
Ohms
dz/dt
Volts
min,CH2,CH4,CH41,
,3087747,3087747,3087747,
0,-0.0518799,17.0624,0,
1.66667E-05,-0.0509644,17.0624,-0.00288295,
3.33333E-05,-0.0497437,17.0624,-0.00983428,
5E-05,-0.0482178,17.0624,-0.0161573,
6.66667E-05,-0.0466919,17.0624,-0.0204402,
8.33333E-05,-0.0448608,17.0624,-0.0213986,
0.0001,-0.0427246,17.0624,-0.0207532,
0.000116667,-0.0405884,17.0624,-0.0229672,

Edit

I tested Shilaghae's code suggestion. It seems to work. However, the length of all the resulting arrays is the same as x1Count, so that zeros remain in the places where Shilaghae's pattern matching code is not able to place a number. (This is a result of how I wrote the code originally.)

I was having trouble finding the indices where zeros remain, but there seemed to be a lot more zeros besides the ones expected where the header was. When I graphed the derivative of the temp[1] output, I saw a number of sharp spikes where false zeros in temp[1] might be. If I can tell where the zeros in temp[1], temp[2], and temp[3] are, I might be able to modify the pattern matching to better retain all the data.

Also, it would be nice to simply shorten the output array to no longer include the rows where the header was in the input file. However, the tutorials I have found regarding variable length arrays only show oversimplified examples like:

int[] anArray = {100, 200, 300, 400};

The code might run faster if it no longer uses scan1 to produce scan1Count. I do not want to slow the code down by using an inefficient method to produce a variable-length array. And I also do not want to skip data in my time series in the cases where the pattern matching is not able to split the input row into 4 numbers. I would rather keep the in-time-series zeros so that I can find them and use them to debug the pattern matching.

Can these things be done in fast-running code?


Second edit

So

"-{0,1}\\d+.\\d+,"  

repeats for times in the expression:

"-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,"  

Does

"-{0,1}\\d+.\\d+,"  

decompose into the following three statements:

"-{0,1}" means that a minus sign occurs zero or one times, while  

"\\d+." means that the minus sign(or lack of minus sign) is followed by several digits of any value followed by a decimal point, so that finally  

"\\d+," means that the decimal point is followed by several digits of any value?  

If so, what about numbers in my data like "1.66667E-05," or "-8.06131E-05," ? I just scanned one of the input files, and (out of 3+ million 4-column rows) it contains 638 numbers that contain E, of which 5 were in the first column, and 633 were in the last column.


Solution

  • You could read line to line the file and for every line you could control with a regular expression (http://www.vogella.de/articles/JavaRegularExpressions/article.html) if the line presents exactly 4 comma. If the line presents exactly 4 comma you can split the line with String.split and fill the 4 array otherwise you pass at next line.

            public double[][] populateArray(int scan1Count){
                double[] outputArray1 = new double[scan1Count];
                double[] outputArray2 = new double[scan1Count];
                double[] outputArray3 = new double[scan1Count];
                double[] outputArray4 = new double[scan1Count];
    
    
                //Read File Line By Line
                try {
                    File tempfile = new File("samedatafile.txt");
                    FileInputStream fis = new FileInputStream(tempfile);
                    DataInputStream in = new DataInputStream(fis);
                    BufferedReader br = new BufferedReader(new InputStreamReader(in));      
                    String strLine;
                    int i = 0;
                    while ((strLine = br.readLine()) != null)   {
                          Pattern pattern = Pattern.compile("-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,");
                          Matcher matcher = pattern.matcher(strLine);
                          if (matcher.matches()){
                              String[] split = strLine.split(",");              
                              outputArray1[i] = Double.parseDouble(split[0]);
                              outputArray2[i] = Double.parseDouble(split[1]);
                              outputArray3[i] = Double.parseDouble(split[2]);
                              outputArray4[i] = Double.parseDouble(split[3]);
                          }
                          i++;
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
                double[][] temp = new double[4][];
                temp[0]= outputArray1;
                temp[1]= outputArray2;
                temp[2]= outputArray3;
                temp[3]= outputArray4;
                return temp;
            }