Search code examples
javamongodbmongodb-java

Aggregate document in MongoDB from two CSV files


I am writing a Java program to insert two CSV files into a single document consisting of a subdocument but I do not know how to do it. I'll explain: I have a SNP file containing the fields rsid, chr, has_sig and a LOCUS file containing the fields rsid, mrna_acc, gene, class, sap_id where in the LOCUS file, for each rsid can correspond more mrna_acc and therefore I will have more rows with same rsid.

I would like a Mongo document this:

{ _id: ObjectId("7264958211f41a0c647c47b1"),
  rsid: rs530,
  chr: 21,
  has_sig: false,
  locus: [
  { mrna_acc: NM_00125,
    gene: ETS2,
    class: utr_variant
  }, 
  { mrna_acc: NM_00126,
    gene: ETS2,
    class: utr_variant
  }, 
  ... ]
}

I tried to read the two CSV files with buffereader and insert them in the document like this:

Document d = new Document();
Document d1 = new Document();

FileSnp fs = new FileSnp("/Users/valentinafratini/Documents/Progetto Tesi/FactoryMethodDb/snp.csv");
    fs.readFile();
    long startTime = System.currentTimeMillis();
    while (fs.line!=null) {
        fs.line = fs.reader.readLine();

        if (fs.line!=null && fs.line.length()>0) {
            fs.obj = fs.line.split("\\s+");
            fs.readSingleObj();

            d.append("rsid", fs.rsid);
            d.append("chr", fs.chr);
            d.append("has_sig", fs.has_sig);
        }
    }

FileLocus fl = new FileLocus("/Users/valentinafratini/Documents/Progetto Tesi/FactoryMethodDb/locus.csv");
    fl.readFile();
    while (fl.line!=null) {
        fl.line = fl.reader.readLine();

        if (fl.line!=null && fl.line.length()>0) {
            fl.obj = fl.line.split("\\s+");
            fl.readSingleObj();

            d1.append("mrna_acc", fl.mrna_acc);
            d1.append("gene", fl.gene);
            d1.append("class", fl.classe);
        }
    }

d.put("locus", d1);
list.add(d);
coll.insertMany(list);

But the result is the insertion of a single line with all the fields of both the snp file and the locus file.

Can you help me? I really do not know how to do it. Thank you very much.


Solution

  • In your target document structure the locus attribute contains an array of sub documents ...

    locus: [
      { mrna_acc: NM_00125,
        gene: ETS2,
        class: utr_variant
      }, 
      { mrna_acc: NM_00126,
        gene: ETS2,
        class: utr_variant
      } 
    ]
    

    This suggests that the FileLocus reader should produce a Document instance for each line in the locus.csv and that each of these documents should be added to a collection in the outer document: d which is created by the FileSnp reader.

    If so, then you should replace the FileLocus block with the following:

    // this will contain the collection of documents, one for each line in `locus.csv`
    List<Document> locusDocuments = new ArrayList<>();
    
    FileLocus fl = new FileLocus("/Users/valentinafratini/Documents/Progetto Tesi/FactoryMethodDb/locus.csv");
    fl.readFile();
    while (fl.line!=null) {
        fl.line = fl.reader.readLine();
    
        if (fl.line!=null && fl.line.length()>0) {
            fl.obj = fl.line.split("\\s+");
            fl.readSingleObj();
    
            // create and populate a sub document for the current line
            Document locusDocument = new Document();
            locusDocument.append("mrna_acc", fl.mrna_acc);
            locusDocument.append("gene", fl.gene);
            locusDocument.append("class", fl.classe);
    
            // assign the current sub document to the collection of locus documents
            locusDocuments.add(locusDocument);
        }
    }
    
    // add the collection of locus documents to the outer document
    d.append("locus", locusDocuments);