I have 2 text files:
1 Extract_tweet.txt
- Format of the file is user_id tweet_id tweet_text
12163922 5407952300 I think I just discovered the hour when the office thermostat changes. And it ain't a good time to be at work...brrrr 2009-11-03 19:22:54
2 locations.txt
- Relevance in below data is the 3rd Column, which acts like the search string
asciiname: name of geographical point in plain ascii characters, varchar(200)
4045431 Point Poker Point Poker 52.89508 173.29911 T CAPE US AK 016 0 9 America/Adak 2013-10-26
I want to extract some data from these files. Data typically has to be only a-z,A-Z and any whitespace. I was earlier thinking of tokenizing the string. However, with no sentinal given, I have thought of using regular expressions instead. PFB the code snippet of extracting 27 characters i.e. a-Z or A-Z or any whitespace. I want to extract only the text in lower case i.e. if there is any character in upper case, it should get converted to lower case.
I will open file 1 - Extract_tweet.txt
and take complete text as a single string. I am then trying to replace each non alphabetic character with null.
public void readfromFile() throws FileNotFoundException
{
Scanner inputStream;
String source=null;
FileInputStream file = new FileInputStream("Extract_tweet.txt");
inputStream = new Scanner(file);
while(inputStream.hasNextLine()) //Read from file till the last line of the file.
{
source = inputStream.nextLine();
System.out.println(source);
replaceAll(source);
}
inputStream.close();
}
public String replaceAll(String source)
{
String regex = "[A-Z]*"+"["+source.toLowerCase()+"|"+"[a-z]*"+"[\\s]";
source = source.replaceAll(regex, "");
System.out.println(source);
return source;
}
public static void main(String[] args) {
StringProcessing sp = new StringProcessing();
try {
sp.readfromFile();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
I get below eerror once I run this code.
60730027 6320951896 @thediscovietnam coo. thanks. just dropped you a line. 2009-12-03 18:41:07
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal character range near index 88
[A-Z]*[60730027 6320951896 @thediscovietnam coo. thanks. just dropped you a line. 2009-12-03 18:41:07|[a-z]*[\s]
I have made some changes. However, i want to change upper case to lower case and also replace all alphanumeric values with null.
Expand your method:
public String replaceAll(String source) throws FileNotFoundException {
String regex = "[A-Z]* |[a-z]*\\s";
source = source.replaceAll(regex, "")
.replaceAll("\\d", "")
.toLowerCase();
System.out.println(source);
writetoFile(source);
return source;
}