Search code examples
javabufferedreader

Java extract multiline values from a file


I'm reading file line by line and some lines have multiline values as below due to which my loop breaks and returns unexpected result.

TSNK/Metadata/tk.filename=PZSIIF-anefnsadual-rasdfepdasdort.pdf
TSNK/Metadata/tk_ISIN=LU0291600822,LU0871812862,LU0327774492,LU0291601986,LU0291605201
,LU0291595725,LU0291599800,LU0726995649,LU0726996290,LU0726995995,LU0726995136,LU0726995482,LU0726995219,LU0855227368
TSNK/Metadata/tk_GroupCode=PZSIIF
TSNK/Metadata/tk_GroupCode/PZSIIF=y
TSNK/Metadata/tk_oneTISNumber=16244,17007,16243,11520,19298,18247,20755
TSNK/Metadata/tk_oneTISNumber_TEXT=Neo Emerging Market Corporate Debt 
Neo Emerging Market Debt Opportunities II 
Neo Emerging Market Investment Grade Debt 
Neo Floating Rate II 
Neo Upper Tier Floating Rate 
Global Balanced Regulation 28 
Neo Multi-Sector Credit Income

Here TSNK/Metadata/tk_ISIN & TSNK/Metadata/tk_oneTISNumber_TEXT have multiline values. While reading line by line from file how do I read these fields as single line ?

I have tried below logic but it did not produce expected result:

try {

        fr = new FileReader(FILENAME);
        br = new BufferedReader(fr);

        String sCurrentLine;

        br = new BufferedReader(new FileReader(FILENAME));
        int i=1;
        CharSequence  OneTIS = "TSNK/Metadata/tk_oneTISNumber_TEXT";
        StringBuilder builder = new StringBuilder();
        while ((sCurrentLine = br.readLine()) != null) {                
            if(sCurrentLine.contains(OneTIS)==true) {
                System.out.println("Line number here -> "+i);
            builder.append(sCurrentLine);
            builder.append(",");
            }
            else {
                System.out.println("else --->");
            }
            //System.out.println("Line number"+i+" Value is---->>>> "+sCurrentLine);
            i++;
        }
        System.out.println("Line number"+i+" Value is---->>>> "+builder);

Solution

  • The solution involves Scanner and multiline regular expressions.

    The assumption here is that all of your lines start with TSNK/Metadata/

    Scanner scanner = new Scanner(new File("file.txt"));
    scanner.useDelimiter("TSNK/Metadata/");
    
    Pattern p = Pattern.compile("(.*)=(.*)", Pattern.DOTALL | Pattern.MULTILINE);
    
    String s = null;
    do {
        if (scanner.hasNext()) {
            s = scanner.next();
            Matcher matcher = p.matcher(s);
            if (matcher.find()) {
                System.out.println("key = '" + matcher.group(1) + "'");
                String[] values = matcher.group(2).split("[,\n]");
                int i = 1;
                for (String value : values) {
                    System.out.println(String.format(" val(%d)='%s',", (i++), value ));
                }
            }
        }
    } while (s != null);
    

    The above produces output

    key = 'tk.filename'
     val(0)='PZSIIF-anefnsadual-rasdfepdasdort.pdf',
    key = 'tk_ISIN'
     val(0)='LU0291600822',
     val(1)='LU0871812862',
     val(2)='LU0327774492',
     val(3)='LU0291601986',
     val(4)='LU0291605201',
     val(5)='',
     val(6)='LU0291595725',
     val(7)='LU0291599800',
     val(8)='LU0726995649',
     val(9)='LU0726996290',
     val(10)='LU0726995995',
     val(11)='LU0726995136',
     val(12)='LU0726995482',
     val(13)='LU0726995219',
     val(14)='LU0855227368',
    key = 'tk_GroupCode'
     val(0)='PZSIIF',
    key = 'tk_GroupCode/PZSIIF'
     val(0)='y',
    key = 'tk_oneTISNumber'
     val(0)='16244',
     val(1)='17007',
     val(2)='16243',
     val(3)='11520',
     val(4)='19298',
     val(5)='18247',
     val(6)='20755',
    key = 'tk_oneTISNumber_TEXT'
     val(0)='Neo Emerging Market Corporate Debt ',
     val(1)='Neo Emerging Market Debt Opportunities II ',
     val(2)='Neo Emerging Market Investment Grade Debt ',
     val(3)='Neo Floating Rate II ',
     val(4)='Neo Upper Tier Floating Rate ',
     val(5)='Global Balanced Regulation 28 ',
     val(6)='Neo Multi-Sector Credit Income',
    

    Please note empty entry (val(5) for key tk_ISIN) due to new line followed by a comma in that entry. It can be sorted quite easily either by rejecting empty strings or by adjusting the splitting pattern.

    Hope this helps!