Search code examples
text-extractiondata-extraction

How to locate a string then get the following characters up to a certain character


Here's an example input:

<div><a class="document-subtitle category" href="/store/apps/category/GAME_ADVENTURE"> <span itemprop="genre">Adventure</span> </a>  </div> <div> </div>

The string i'm trying to locate is this:

document-subtitle category" href="/store/apps/category/

and I want to extract the characters that follows that string up until the end of the href attribute (">).

In this case, my output should be:

GAME_ADVENTURE

My input file is guaranteed to have only one string that matches exactly to:

document-subtitle category" href="/store/apps/category/

What's the easiest way of achieving this?


Solution

  • This worked for me:

    import java.io.IOException;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    
    public class ExtractData {
      public static String matcher = "document-subtitle category\" href=\"/store/apps/category/";
    
      public static void main(String[] args) throws IOException {
        String filePath = args[0];
        String content = new String(Files.readAllBytes(Paths.get(filePath)));
        int startIndex = content.indexOf(matcher);
        int endIndex = content.indexOf("\">", startIndex);
        String category = content.substring(startIndex + matcher.length(), endIndex);
        System.out.println("category is " + category);
      }
    }