Search code examples
javaregexregex-groupjackson-databind

Splitting regex for "[{'v4','v5'},{'v6','v7'}]"


TL;DR

I have a large set of data that looks like arrays of value-only JSON objects. I'm wondering if regex can concisely handle this structure:

"[]"                        --> List.of(List.of())
"[{'v1','v2','v3'}]"        --> List.of(List.of("v1","v2","v3"))
"[{'v4','v5'},{'v6','v7'}]" --> List.of(List.of("v4","v5"),List.of("v6","v7"))

The values are ordered; a POJO will be constructed for each inner list with the ordered arguments and each value is a primitive (int, long, or String) as defined in each POJO.

Details

This parser is part of a Jackson CSV serializer/deserializer of POJOs with a container of POJO. Unfortunately, CsvMapper only supports POJOs with a container of primitives thus the need for the custom parser (as far as I can tell). As an example, a structure like this:

record Person(
    String name,
    List<Pet> pets) {
}

record Pet(
    String name,
    String type) {
}

so the following:

new Person("Jan", List.of(new Pet("Mr. Bubbles", "dog"), new Pet("Lilly", "cat")));

is serialized to CSV as two columns:

Jan,"[{'Mr. Bubbles','dog'}, {'Lilly','cat'}]"

where the second column is a container of POJOs. To deserialize this, my custom Jackson deserializer does this:

public static class PersonDeserializer extends StdDeserializer<Person> {
    private static final long serialVersionUID = 1L;

    public PersonDeserializer() {
        this(Person.class);
    }

    public PersonDeserializer(Class<Person> type) {
        super(type);
    }

    @Override
    public Person deserialize(JsonParser p, DeserializationContext ctxt)
      throws IOException, JsonProcessingException {
        JsonNode node = p.getCodec().readTree(p);

        String name = node.get("name").asText();
        List<Pet> pets = deserialize(node.get("pets").asText());

        return new Person(name, pets);
    }

    private static List<Pet> deserialize(String serializedPets) {
        List<Pet> pets = new ArrayList<>();

        // messy custom parser, ATM
        // ...

        return pets;
    }

}

where a very messy deserialize custom parser builds the POJOs. I'm hoping someone with more regex experience can help?


Solution

  • Here's a pure java way:

    List<List<String>> result = Arrays.stream(input.replaceAll("^.\\{?|}?.$", "").split("},\\{"))
        .map(inner -> Arrays.stream(inner.replaceAll("^.'?|'?.$", "").split("','")).toList())
        .toList();
    

    which:

    • trims [{ from the front and }] from the back (curlys optional)
    • splits on },{
    • does a similar thing to the inner arrays

    You could also use a json parser:

    ObjectMapper om = new ObjectMapper().configure(JsonParser.Feature.ALLOW_SINGLE_QUOTES, true);
    List<List<String>> result = om.readValue(input.replace("{", "[").replace("}","]"), new TypeReference<List<List<String>>>() {});
    

    but you must:

    • allow single quotes
    • replace curlys with squares to become valid json