Search code examples
scalalarge-databigdata

updating line in large text file using scala


i've a large text file around 43GB in .ttl contains triples in the form :

<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://la.dbpedia.org/resource/Mahatma_Gandhi> .
<http://www.wikidata.org/entity/Q1001> <http://www.w3.org/2002/07/owl#sameAs> <http://lad.dbpedia.org/resource/Mohandas_Gandhi> .

and i want to find the fastest way to update a specific line inside the file without rewriting all next text. either by updating it or deleting it and appending it to the end of the file

to access the specific line i use this code :

val lines = io.Source.fromFile("text.txt").getLines
val seventhLine = lines drop(10000000) next

Solution

  • If you want to use text files, consider a fixed length/record size for each line/record.

    This way you can use a RandomAccessFile to seek to the exact position of each line by number: You just seek to line * LineSize, and then update it.

    It will not really help, if you have to insert a new line. Other limitations are: The file size will grow (because of the fixed record length), and there will always be one record which is too big.

    As for the initial conversion:

    • Get the maximum line length of the current file, then add 10% for example.
    • Now you have to convert the file once: Read a line from the text file, and convert it into a fixed-size record.
    • You could use a special character like | to separate the fields. If possible, use somthing like ;, so you get a .csv file
    • I suggest padding the remaining space it with spaces, so it still looks like a text file which you can parse with shell utilities.
    • You could use a \n to terminate the record.

    For example

    http://x.com|http://x.com|http://x.com|...\n
    

    or

    http://x.com;http://x.com;http://x.com;...\n
    

    where each . at the end represents a space character. So it's still somehow compatible with a "normal" text file.


    On the other hand, looking at your data, consider using a key-value data store like Redis: You could use the line number or the 1st URL as the key.