Search code examples
csvmarklogicmlcp

When i am sending the csv file to marklogic it is not overwriting the previous one?


I am sending the following csv files to marklogic

id,first_name,last_name,email,country,ip_address
5,Shawn,Grant,[email protected],Liberia,37.194.161.124
5,Joshua,Fields,[email protected],Colombia,54.224.238.176
5,Johnny,Bell,[email protected],Finland,159.38.61.122

Through mlcp using following command

C:\mlcp-9.0.3\bin>mlcp.bat import -host localhost -port 9636 -username admin -pa
ssword admin -input_file_path D:\test.csv -input_file_type delimited_text -docum
ent_type json

What happened ?

When i seen query console i had one JSON document with following information

 id,first_name,last_name,email,country,ip_address
 5,Shawn,Grant,[email protected],Liberia,37.194.161.124

What i am expecting ?

By default first column of csv is taken by creating json/xml document . Since i am sending 3 rows it should have latest information(i.e.3rd row) right.

By Assumption

Since i am sending all three rows at once in mlcp we cant say which one is going first to ML DB

Let me know whether my assumption is right or wrong ..

Thanks


Solution

  • MLCP wants to be as fast as possible. In the case of CSV files it will process the rows using many threads (and even shard the document if you pass the split option). With this, there is no guarantee that it will be processed in any particular order. You may be able to tune some of the settings in MLCP to use one thread and not shard the file to affect the results you want, but in that case, you are loosing some of the power of MLCP.

    Second to that, an observaion: You are adding quite a bit of overhead of inserting and overwriting un-needed documents from how I interpret your problem statement. Why not sort and filter your initial CSV document to only one record per ID and save your computer from doing more work.