Search code examples
javacsvsortingexport-to-csv

I need to take a CSV file and split it into separate files based on Column Header [JAVA]


I am fairly new to Java and am struggling to read > sort > export a csv. I have a csv with [X, Y, Z, Scalar 1, Scalar 2, Scalar 3, Scalar 4] as headers that need to be separated into 4 csv's. The actual file is thousands of lines long so short example:

[X,Y,Z, Sc1, Sc2, Sc3, Sc4]
[1,0,0,   5,   7,   9,  10]
[0,1,1,   6,   8,   4,   0]
[0,0,1,   3,   3,   8,   2]

I need split the source csv into 4 separate csv's with one scalar value and the x,y,z data.

File 1       | File 2       | File 3       | File 4
----------------------------------------------------------
[Sc1, X,Y,Z] | [Sc2, X,Y,Z] | [Sc3, X,Y,Z] | [Sc4,  X,Y,Z]
[5,   1,0,0] | [7,   1,0,0] | [9,   1,0,0] | [10,   1,0,0]
[6,   0,1,1] | [8,   0,1,1] | [4,   0,1,1] | [ 0,   0,1,1]
[3,   0,0,1] | [3,   0,0,1] | [8,   0,0,1] | [ 2,   0,0,1]

I am currently reading the data in with BufferedReader, but I am not sure how to organize the data once its read or if this is even a good approach.

 ArrayList<String> readFileFast (String expDir,String filename) {
        String path = expDir + filename;
        ArrayList<String> fileContents = new ArrayList<>();
        try {
            BufferedReader br = new BufferedReader(new FileReader(path));
            String line;
            while ((line = br.readLine()) != null) {
                fileContents.add(line);
            }
        } catch (Exception e) {
            SuperStackPrint(e);
        }
        return fileContents;
      }

println(readFileFast(expDir, "/DELETEME.csv"));

Any insight on how to do this properly would be appreciated.


Solution

  • You will benefit from using a library which specializes in reading and writing CSV files. There are a few to choose from, but here I will use OpenCSV.

    If you don't end up using this library, it may at least give you some ideas for your own approach.

    Also, when using libraries, I recommend using a tool such as Maven or Gradle to help manage this, as these tools take care of "dependencies of dependencies" for you - for example, where the OpenCSV library itself needs access to other libraries which it uses.

    For Maven, here is OpenCSV dependency for my POM file:

    <dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>5.2</version>
    </dependency>
    

    The approach:

    1. Create a Java class (a "bean") to hold the data that will be loaded from the source CSV file. This will be called SplitBean in my example.

    2. Create a collection of objects using this class, where each object contains the data for one row of the CSV file.

    3. Iterate across this collection of objects, writing the relevant parts to 4 output files.

    You can choose to follow the above approach without using OpenCSV or a similar library. But you will have to write more of your own code relating to basic CSV operations. In your case, the data is not complicated, so that would not be unreasonable.

    Either way, I recommend creating a class to represent a row of input data, and then processing a list of such objects when writing to your output files. This splits the process into 2 distinct steps, and makes use of Java objects to simplify the process.

    Here is the SplitBean class:

    import com.opencsv.bean.CsvBindByName;
            
    public class SplitBean {
        @CsvBindByName(column = "X")
        private int x;
    
        @CsvBindByName(column = "Y")
        private int y;
    
        @CsvBindByName(column = "Z")
        private int z;
        
        @CsvBindByName(column = "Sc1")
        private int  sc1;
    
        @CsvBindByName(column = "Sc2")
        private int  sc2;
    
        @CsvBindByName(column = "Sc3")
        private int  sc3;
    
        @CsvBindByName(column = "Sc4")
        private int  sc4;
    
        public static String[] getHeadingsOne() {
            String[] s = { "Sc1", "X", "Y", "Z" };
            return s;
        }
        
        public static String[] getHeadingsTwo() {
            String[] s = { "Sc2", "X", "Y", "Z" };
            return s;
        }
        
        public static String[] getHeadingsThree() {
            String[] s = { "Sc3", "X", "Y", "Z" };
            return s;
        }
        
        public static String[] getHeadingsFour() {
            String[] s = { "Sc4", "X", "Y", "Z" };
            return s;
        }
        
        public String[] getDataOne() {
            String[] i = { String.valueOf(sc1), String.valueOf(x), 
                String.valueOf(y), String.valueOf(z) };
            return i;
        }
        
        public String[] getDataTwo() {
            String[] i = { String.valueOf(sc2), String.valueOf(x), 
                String.valueOf(y), String.valueOf(z) };
            return i;
        }
        
        public String[] getDataThree() {
            String[] i = { String.valueOf(sc3), String.valueOf(x), 
                String.valueOf(y), String.valueOf(z) };
            return i;
        }
        
        public String[] getDataFour() {
            String[] i = { String.valueOf(sc4), String.valueOf(x), 
                String.valueOf(y), String.valueOf(z) };
            return i;
        }
        
        public int getX() {
            return x;
        }
    
        public void setX(int x) {
            this.x = x;
        }
    
        public int getY() {
            return y;
        }
    
        public void setY(int y) {
            this.y = y;
        }
    
        public int getZ() {
            return z;
        }
    
        public void setZ(int z) {
            this.z = z;
        }
    
        public int getSc1() {
            return sc1;
        }
    
        public void setSc1(int sc1) {
            this.sc1 = sc1;
        }
    
        public int getSc2() {
            return sc2;
        }
    
        public void setSc2(int sc2) {
            this.sc2 = sc2;
        }
    
        public int getSc3() {
            return sc3;
        }
    
        public void setSc3(int sc3) {
            this.sc3 = sc3;
        }
    
        public int getSc4() {
            return sc4;
        }
    
        public void setSc4(int sc4) {
            this.sc4 = sc4;
        }
        
    }
    

    This class uses @CsvBindByName annotations to map from column heading names in the source CSV file to field names in the class itself. You do not need to do things this way, but it's a convenient feature provided by OpenCSV.

    The class also contains methods which handle the 4 different output files (which are subsets of the input file's data).

    Now we can write a separate doTheSplit() method, to use this class:

    import com.opencsv.bean.CsvToBean;
    import com.opencsv.bean.CsvToBeanBuilder;
    import com.opencsv.bean.HeaderColumnNameMappingStrategy;
    import com.opencsv.exceptions.CsvDataTypeMismatchException;
    import com.opencsv.exceptions.CsvRequiredFieldEmptyException;
    import com.opencsv.CSVWriter;
    import java.io.IOException;
    import java.io.Reader;
    import java.io.FileWriter;
    import java.net.URISyntaxException;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.nio.file.Paths;
    import java.util.List;
    
    public class SplitData {
    
        public void doTheSplit() throws URISyntaxException, IOException,
                CsvDataTypeMismatchException, CsvRequiredFieldEmptyException {
            HeaderColumnNameMappingStrategy msIn = new HeaderColumnNameMappingStrategy();
            msIn.setType(SplitBean.class);
    
            Path path = Paths.get("C:/tmp/csvsplit/input.csv");
            List<SplitBean> list;
    
            // read the data from the input CSV file into our SplitBean list:
            try ( Reader reader = Files.newBufferedReader(path)) {
                CsvToBean cb = new CsvToBeanBuilder(reader)
                        .withMappingStrategy(msIn)
                        .build();
                list = cb.parse();
                int i = 1;
            }
    
            // set up 4 file writers:
            try ( CSVWriter writer1 = new CSVWriter(new FileWriter("C:/tmp/csvsplit/output1.csv"));
                    CSVWriter writer2 = new CSVWriter(new FileWriter("C:/tmp/csvsplit/output2.csv"));
                    CSVWriter writer3 = new CSVWriter(new FileWriter("C:/tmp/csvsplit/output3.csv"));
                    CSVWriter writer4 = new CSVWriter(new FileWriter("C:/tmp/csvsplit/output4.csv"))) {
    
                // first write the headers to each file (false = no quotes):
                writer1.writeNext(SplitBean.getHeadingsOne(), false);
                writer2.writeNext(SplitBean.getHeadingsTwo(), false);
                writer3.writeNext(SplitBean.getHeadingsThree(), false);
                writer4.writeNext(SplitBean.getHeadingsFour(), false);
                
                // then write each row of data (false = no quotes):
                for (SplitBean item : list) {
                    writer1.writeNext(item.getDataOne(), false);
                    writer2.writeNext(item.getDataTwo(), false);
                    writer3.writeNext(item.getDataThree(), false);
                    writer4.writeNext(item.getDataFour(), false);
                }
            }
        }
    
    }
    

    The first part of this code populates a List<SplitBean> list. There is one splitBean object for each row of data from the input spreadsheet. OpenCSV takes care of most of the work for you, behind the scenes.

    Then, the code creates 4 file writers, which use the OpenCSV CSVWriter object, to help handle the formatting of our data into valid CSV rows.

    With this code, we write column headers into each of the 4 files. Finally we iterate across our collection of SplitBean items, and write the relevant data subsets to each file.

    So, for a CSV input file such as this:

    X,Y,Z,Sc1,Sc2,Sc3,Sc4
    1,0,0,5,7,9,10
    0,1,1,6,8,4,0
    0,0,1,3,3,8,2
    

    We end up with 4 different output files. One example:

    Sc1,X,Y,Z
    5,1,0,0
    6,0,1,1
    3,0,0,1
    

    Additional note: One big advantage of using the SplitBean class in this way is that you have a lot more flexibility if you decide you need to perform more data transformations - for example, filtering out rows of data, or sorting data in different ways.