Search code examples
javaunicodeescapinggrepfacebook-graph-api

How to unescape non-usa, en, ASCII type characters using grep?


I am using grep to parse a friend list obtained via the facebook Open Graph API. I am mostly able to do what I want with the following command, issued in bash:

grep -aiPo '"name":"(.*?)","id":"[[:digit:]]*"' friends?blahblah-access-token-stuff

which yields a list which looks like:

"name":"John Day","id":"--id ommitted--"
"name":"Andria Cast\u00f1eda","id":"--id ommitted--" // let me draw your attention here
"name":"Jane Doe","id":"--id ommitted--"

Names were changed above to preserve privacy

If you notice, there is an unescaped sequence in the middle entry, that corresponds to a tilde N. Is there an easy way to to feed such characters into a java program (my primary intention) so that java understands that \u00f1eda is unicode speak for the curly n?

I would prefer not to solve this problem by parsing the string in java and manually unescaping the unicode. I would very much prefer to instruct grep to handle this situation, or another GNU or open source tool that is widely available for bash.

At that point, I would feed the entire input as a file to a java program without having to worry about OMG, is that a unicode escape sequence!!? Java would naturally detect the unicode characters and map them to it's corresponding internal representation.

Thanks in advance!


Solution

  • Java understands Unicode. You provide Java Unicode escapes in the following manner:

    String str = "\u00F6";
    

    So if you pass a string such as "Andria Cast\u00f1eda" which is an escaped sequence, it should be handled correctly without any additional handling required.

    Here's also a very brief, but easy to understand introduction:

    Unicode in Java

    If you're still not convinced, try this class:

    public class UnicodeExample {
    
        public static void main(String[] args) {
            
            String escaped = new String("\u00f1");
            String unescaped = new String("ñ");
            System.out.println(escaped);        
            System.out.println(unescaped);
            
            if(escaped.equals(unescaped)){
                System.out.println("The strings are the same!");
            }
            else {
                System.out.println("The strings are different!");
            }
    
        }
    
    }