ESAPI validate URL fails due to &language parameter

I am trying to validate URL using ESAPI validation, but my validation is failing due to &lang. If I remove the language parameter then the validation is successful. Please check my URL pattern and let me know why this pattern is failing.

String url="http://google.com:000/menu.jsp?userid=test&age=22&language=hindi";

ESAPI.validator().getValidInput("URL", url,"URL",100000,false);

Reg Expression

Validator.URL=^(ht|f)tp(s?)\\:\\/\\/[0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*(:(0-9)*)*(\\/?)([a-zA-Z0-9\\-\\.\\?\\,\\:\\'\\/\\\\\\+=&amp;%\\&amp;language\\$#_]*)?$

Solution

This is by design. What's going on here is that your input is getting caught by ESAPI's canonicalize method which is detecting that the URI is containing BOTH URI encodings, and HTML entity encodings.

The next release of ESAPI will include a convenience method for this, it took a few steps however.

Convenience method:

    /**
     * {@inheritDoc}
     */
    public boolean isValidURI(String context, String input, boolean allowNull) {
        boolean isValid = false;
        URI compliantURI = this.getRfcCompliantURI(input);

        try{
            if(null != compliantURI){
                String canonicalizedURI = getCanonicalizedURI(compliantURI);
                //if getCanonicalizedURI doesn't throw an IntrusionException, then the URI contains no mixed or 
                //double-encoding attacks.  
                logger.info(Logger.SECURITY_SUCCESS, "We did not detect any mixed or multiple encoding in the uri:[" + input + "]");
                Validator v = ESAPI.validator();
                //This part will use the regex from validation.properties.  This regex should be super-simple, and 
                //used mainly to restrict certain parts of a URL.  
                Pattern p = ESAPI.securityConfiguration().getValidationPattern( "URL" );
                //We're doing this instead of using the normal validator API, because it will canonicalize the input again
                //and if the URI has any queries that also happen to match HTML entities, like &para;
                //it will cease conforming to the regex we now specify for a URL.
                isValid = p.matcher(canonicalizedURI).matches();
            }

        }catch (IntrusionException e){
            logger.error(Logger.SECURITY_FAILURE, e.getMessage());
            isValid = false;
        }


        return isValid;
    }

    /**
     * This does alot.  This will extract each piece of a URI according to parse zone, and it will construct 
     * a canonicalized String representing a version of the URI that is safe to run regex against to it. 
     * 
     * @param dirtyUri
     * @return
     * @throws IntrusionException
     */
    public String getCanonicalizedURI(URI dirtyUri) throws IntrusionException{

//      From RFC-3986 section 3     
//        URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
//
//                hier-part   = "//" authority path-abempty
//                            / path-absolute
//                            / path-rootless
//                            / path-empty

//         The following are two example URIs and their component parts:
//
//               foo://example.com:8042/over/there?name=ferret#nose
//               \_/   \______________/\_________/ \_________/ \__/
//                |           |            |            |        |
//             scheme     authority       path        query   fragment
//                |   _____________________|__
//               / \ /                        \
//               urn:example:animal:ferret:nose
        Map<UriSegment, String> parseMap = new EnumMap<UriSegment, String>(UriSegment.class);
        parseMap.put(UriSegment.SCHEME, dirtyUri.getScheme());
        //authority   = [ userinfo "@" ] host [ ":" port ]
        parseMap.put(UriSegment.AUTHORITY, dirtyUri.getRawAuthority());
        parseMap.put(UriSegment.SCHEMSPECIFICPART, dirtyUri.getRawSchemeSpecificPart());
        parseMap.put(UriSegment.HOST, dirtyUri.getHost());
        //if port is undefined, it will return -1
        Integer port = new Integer(dirtyUri.getPort());
        parseMap.put(UriSegment.PORT, port == -1 ? "": port.toString());
        parseMap.put(UriSegment.PATH, dirtyUri.getRawPath());
        parseMap.put(UriSegment.QUERY, dirtyUri.getRawQuery());
        parseMap.put(UriSegment.FRAGMENT, dirtyUri.getRawFragment());

        //Now we canonicalize each part and build our string.  
        StringBuilder sb = new StringBuilder();

        //Replace all the items in the map with canonicalized versions.

        Set<UriSegment> set = parseMap.keySet();

        SecurityConfiguration sg = ESAPI.securityConfiguration();
//      boolean restrictMixed = sg.getBooleanProp("AllowMixedEncoding");
//      boolean restrictMultiple = sg.getBooleanProp("AllowMultipleEncoding");
        boolean allowMixed = sg.getAllowMixedEncoding();
        boolean allowMultiple = sg.getAllowMultipleEncoding();
        for(UriSegment seg: set){
            String value = encoder.canonicalize(parseMap.get(seg), allowMultiple, allowMixed);
            value = value == null ? "" : value;
            //In the case of a uri query, we need to break up and canonicalize the internal parts of the query.
            if(seg == UriSegment.QUERY && null != parseMap.get(seg)){
                StringBuilder qBuilder = new StringBuilder();
                try {
                    Map<String, List<String>> canonicalizedMap = this.splitQuery(dirtyUri);
                    Set<Entry<String, List<String>>> query = canonicalizedMap.entrySet();
                    Iterator<Entry<String, List<String>>> i = query.iterator();
                    while(i.hasNext()){
                        Entry<String, List<String>> e = i.next(); 
                        String key = (String) e.getKey();
                        String qVal = "";
                        List<String> list = (List<String>) e.getValue();
                        if(!list.isEmpty()){
                            qVal = list.get(0);
                        }
                        qBuilder.append(key)
                        .append("=")
                        .append(qVal);

                        if(i.hasNext()){
                            qBuilder.append("&");
                        }
                    }
                    value = qBuilder.toString();
                } catch (UnsupportedEncodingException e) {
                    logger.debug(Logger.EVENT_FAILURE, "decoding error when parsing [" + dirtyUri.toString() + "]");
                }
            }
            //Check if the port is -1, if it is, omit it from the output.
            if(seg == UriSegment.PORT){
                if("-1" == parseMap.get(seg)){
                    value = "";
                }
            }
            parseMap.put(seg, value );
        }

        return buildUrl(parseMap);
    }

    /**
     * The meat of this method was taken from StackOverflow:  http://stackoverflow.com/a/13592567/557153
     * It has been modified to return a canonicalized key and value pairing.  
     * 
     * @param java URI
     * @return a map of canonicalized query parameters.  
     * @throws UnsupportedEncodingException
     */
        public Map<String, List<String>> splitQuery(URI uri) throws UnsupportedEncodingException {
          final Map<String, List<String>> query_pairs = new LinkedHashMap<String, List<String>>();
          final String[] pairs = uri.getQuery().split("&");
          for (String pair : pairs) {
            final int idx = pair.indexOf("=");
            final String key = idx > 0 ? encoder.canonicalize(pair.substring(0, idx)) : pair;
            if (!query_pairs.containsKey(key)) {
              query_pairs.put(key, new LinkedList<String>());
            }
            final String value = idx > 0 && pair.length() > idx + 1 ? URLDecoder.decode(pair.substring(idx + 1), "UTF-8") : null;
            query_pairs.get(key).add(encoder.canonicalize(value));
          }
          return query_pairs;
        }

        public enum UriSegment {
            AUTHORITY, SCHEME, SCHEMSPECIFICPART, USERINFO, HOST, PORT, PATH, QUERY, FRAGMENT
        }

The reference code lives here.

Please note that parts of this will be refactored into the DefaultEncoder class, as the canonicalize method properly belongs there.

The code as written is heavily tested in the ESAPI project code, but I might have forgotten a method or two. You could clone and compile the esapi project as-is, but some organizations won't permit you to use bleeding-edge non-release binaries.