Search code examples
javaregexstringreplaceall

Extract alphabets with white-spaces from a given text file without having sentinel in the file


I have 2 text files:

1 Extract_tweet.txt - Format of the file is user_id tweet_id tweet_text

12163922    5407952300  I think I just discovered the hour when the office thermostat changes. And it ain't a good time to be at work...brrrr   2009-11-03 19:22:54

2 locations.txt - Relevance in below data is the 3rd Column, which acts like the search string

asciiname: name of geographical point in plain ascii characters, varchar(200)

4045431 Point Poker Point Poker     52.89508    173.29911   T   CAPE    US      AK  016         0       9   America/Adak    2013-10-26

I want to extract some data from these files. Data typically has to be only a-z,A-Z and any whitespace. I was earlier thinking of tokenizing the string. However, with no sentinal given, I have thought of using regular expressions instead. PFB the code snippet of extracting 27 characters i.e. a-Z or A-Z or any whitespace. I want to extract only the text in lower case i.e. if there is any character in upper case, it should get converted to lower case.

I will open file 1 - Extract_tweet.txt and take complete text as a single string. I am then trying to replace each non alphabetic character with null.

   public void readfromFile() throws FileNotFoundException
    {
        Scanner inputStream;
        String source=null;
        FileInputStream file = new FileInputStream("Extract_tweet.txt");    
        inputStream = new Scanner(file);
        while(inputStream.hasNextLine())    //Read from file till the last line of the file.
        {
            source = inputStream.nextLine();
            System.out.println(source);
            replaceAll(source);

        }
        inputStream.close();
    }
    public String replaceAll(String source) 
    {
        String regex = "[A-Z]*"+"["+source.toLowerCase()+"|"+"[a-z]*"+"[\\s]";
        source = source.replaceAll(regex, "");
        System.out.println(source);
        return source;
    }

    public static void main(String[] args) {

        StringProcessing sp = new StringProcessing();
        try {
            sp.readfromFile();
        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

I get below eerror once I run this code.

60730027    6320951896  @thediscovietnam coo.  thanks. just dropped you a line. 2009-12-03 18:41:07
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal character range near index 88
[A-Z]*[60730027 6320951896  @thediscovietnam coo.  thanks. just dropped you a line. 2009-12-03 18:41:07|[a-z]*[\s]

Solution

  • I have made some changes. However, i want to change upper case to lower case and also replace all alphanumeric values with null.

    Expand your method:

    public String replaceAll(String source) throws FileNotFoundException {
        String regex = "[A-Z]* |[a-z]*\\s";
        source = source.replaceAll(regex, "")
                       .replaceAll("\\d", "")
                       .toLowerCase();
    
        System.out.println(source);
        writetoFile(source);
        return source;
    }