Search code examples
javajaunt-api

Jaunt Java getText() returning correct text but with lots of "?"


The title explains all, also, I have tried removing them

(because the text is there, but instead of "aldo" there is "al?do", also it seems to have a random pattern)

with (String).replace("?", ""), but with no success.

I have also used this, with a combination of UTF_8,UTF_16 and ISO-8859, with no success.

byte[] ptext = tempName.getBytes(UTF_8); 
String tempName1 = new String(ptext, UTF_16); 

An example of what I am getting:

Studded Regular Sweatshirt          // Instead of this
S?tudde?d R?eg?ular? Sw?eats?h?irt  // I get this

Could it be the website that notices the headless browser and tries to "spoof" its content? How can I overcome this?


Solution

  • It looks very likely that site you scrapping intent mix up the 3f and 64 characters into your result. so you have to mask your self as a normal browser to scrapping or filter it out by replacing.

    text simple

    Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe
    

    after filteration

    Scarface Embroidered Leather
    
    
    
    
    //Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe
    //Scarface Embroidered Leathe
    
    String hex="5363613f3f3f7266613f3f3f63653f3f3f20453f3f3f6d62723f3f3f6f69‌​643f3f3f65726564204c‌​653f3f3f61746865";
    byte[] bytes= hexStringToBytes(hex);
    
    //the only line you need
    String res = new String(bytes,"UTF-8").replaceAll("\\\u003f","").replaceAll('�',"").replaceAll("�","");
    
    private static byte charToByte(char c) {
        return (byte) "0123456789ABCDEF".indexOf(new String(c));
    }
    
    
    public static byte[] hexStringToBytes(String hexString) {
        if (hexString == null || hexString.equals("")) {
            return null;
        }
        hexString = hexString.toUpperCase();
        int length = hexString.length() / 2;
        char[] hexChars = hexString.toCharArray();
        byte[] d = new byte[length];
        for (int i = 0; i < length; i++) {
            int pos = i * 2;
            d[i] = (byte) (charToByte(hexChars[pos]) << 4 | charToByte(hexChars[pos + 1]));
    
        }
        return d;
    }
    
    public static String bytesToHexString(byte[] src){
        StringBuilder stringBuilder = new StringBuilder("");
        if (src == null || src.length <= 0) {
            return null;
        }
        for (int i = 0; i < src.length; i++) {
            int v = src[i] & 0xFF;
            String hv = Integer.toHexString(v);
            if (hv.length() < 2) {
                stringBuilder.append(0);
            }
            stringBuilder.append(hv);
        }
        return stringBuilder.toString();
    }
    
    public   String printHexString( byte[] b) {
        String a = "";
        for (int i = 0; i < b.length; i++) { 
            String hex = Integer.toHexString(b[i] & 0xFF); 
            if (hex.length() == 1) { 
                hex = '0' + hex; 
            }
    
            a = a+hex;
        } 
    
        return a;
    }