Search code examples
scalalarge-filesends-with

Is there a more efficient way to check whether one line ends with another line in a large file


I have a file with 500,000 lines and I want to check for each line L whether any other line in the same file ends with L.

I have already sorted the file by the length of the lines and have written the following code but it is to slow:

def main(args: Array[String]): Unit = {
    val buffer = new BufferedReader(new FileReader("input.txt"))
    val fw = new FileWriter("output.txt")
    var line = buffer.readLine()
    var list = List.empty[String]
    while (line != null) {
      if (line.nonEmpty) {
        list += line
      }
      line = buffer.readLine()
    }
    buffer.close()
    list = list.sortBy(s => s.length)
    for (i <- list.indices) {
      val endsWith = list(i)
      for (j <- i + 1 until list.size) {
        val right = list(j)
        if (right.endsWith(endsWith)) {
          fw.write(list(j) + ";" + list(i) + "\n")
          fw.flush()
        }
      }
      println(i + 1)

    }
    fw.close()
  }

The input file contains such entries like:
abc/defg
defg
...

Is there a more efficient way to check the lines?


Solution

  • I have found a solution to the problem by using the length of each line L and extract this length from the other lines. Afterward, I hash L and the extracted strings and store them in a hash map. Finally, I check whether the map contains the query hash. As soon as two lines have the same length I need only to check the hash in the map, this saves me a lot of overhead. I also used the idea from talex to reverse each string before hashing.

    Here the code for the solution:

    def main(args: Array[String]): Unit = {
        val buffer = new BufferedReader(new FileReader("input.txt"))
        val fw = new FileWriter("output.txt")
        var line = buffer.readLine()
        var map = Map.empty[Int, List[String]]
        while (line != null) {
          if (line.nonEmpty) {
            val len = line.length
            map += len -> (map.getOrElse(len, List.empty[String]) ++ List(line.reverse))
          }
          line = buffer.readLine()
        }
        buffer.close()
        val list = map.keySet.toList.sorted
        for (i <- list.indices) {
          val len = list(i)
          var cutMap = Map.empty[Int, List[String]]
          for (j <- i + 1 until list.size) {
            for (right <- map(list(j))) {
              val took = right.take(len)
              cutMap += took.hashCode -> (cutMap.getOrElse(took.hashCode, List.empty[String]) ++ List(right))
            }
          }
          for (startsWith <- map(len)) {
            val hashCode = startsWith.hashCode
            if (cutMap.contains(hashCode)) {
              for (right <- cutMap(hashCode)) {
                fw.write(right.reverse + ";" + startsWith.reverse + "\n")
                fw.flush()
              }
            }
          }
          println(i + 1)
        }
        fw.close()
      }
    

    If the hashCode function is not precise enough it can be replace by a more precise one. For bigger files the algorithm could be separated to work with file splitting.