Search code examples
rdfwekajenaontologydecision-tree

Is there any way to map Weka j48 decision tree output to RDF format?


I want to create an ontology using Jena based on Weka j48 decision tree output. But this output needs to be mapped to RDF format before inputting it to Jena. Is there any way to do this mapping ?

EDIT1:

Sample part of j48 decision tree output before mapping: before

Sample part of RDF corresponding to decision tree output: after

These 2 screens are from this research paper (slide 4):

Efficient Spam Email Filtering using Adaptive Ontology


Solution

  • There is likely no built-in way to do that.

    Disclaimer: I have never worked with Jena and RDF before. So this answer may be incomplete or miss the point of the intended conversion.

    But nevertheless, and first of all, a short rant:


    <rant>

    The snippets that are published in the paper (namely, the output of the Weka classifier and the RDF) are incomplete and obviously inconsistent. The process of the conversion is not described at all. Instead, they only mentioned:

    The challenge we faced was mainly to make J48 classification outputs to RDF and gave it to Jena

    (sic!)

    Now, they solved it somehow. They could have provided their conversion code on a public, open-source repository. This would allow others to provide improvements, it would increase the visibility and verifiability of their approach. But instead, they wasted their time and the time of the readers with screenshots of various websites that serve as page-fillers in a pitiful attempt to squeeze yet another publication out of their approach.

    </rant>


    The following is my best-effort approach to provide some of the building blocks that may be necessary for the conversion. It has to be taken with a grain of salt, because I'm not familiar with the underlying methods and libraries. But I hope that it can be considered as "helpful" nevertheless.

    The Weka Classifier implementations usually do not provide the structures that they are using for their internal workings. So it is not possible to access the internal tree structure directly. However, there is a method prefix() that returns a string representation of the tree.

    The code below contains a very pragmatic (and thus, somewhat brittle...) method that parses this string and builds a tree structure that contains the relevant information. This structure consists of TreeNode objects:

    static class TreeNode
    {
        String label;
        String attribute;
        String relation;
        String value;
        ...
    }
    
    • The label is the class label that was used for the classifier. This is only non-null for leaf nodes. For the example from the paper, this would be "0" or "1", indicating whether an email is spam or not.

    • The attribute is the attribute that a decision is based on. For the example from the paper, such an attribute may be word_freq_remove

    • The relation and value are the strings representing the decision criteria. These may be "<=" and "0.08" for example.

    After such a tree structure has been created, it can be converted into an Apache Jena Model instance. The code contains such a conversion method, but due to my lack of familiarity with RDF, I'm not sure whether it makes sense conceptually. Adjustments may be necessary in order to create the "desired" RDF structure out of this tree structure. But naively, the output looks like it could make sense.

    import java.io.FileInputStream;
    import java.util.ArrayList;
    import java.util.List;
    
    import org.apache.jena.rdf.model.Model;
    import org.apache.jena.rdf.model.ModelFactory;
    import org.apache.jena.rdf.model.Property;
    import org.apache.jena.rdf.model.Resource;
    import org.apache.jena.rdf.model.Statement;
    
    import weka.classifiers.trees.J48;
    import weka.core.Instances;
    import weka.core.converters.ArffLoader;
    
    public class WekaClassifierToRdf
    {
        public static void main(String[] args) throws Exception
        {
            String fileName = "./data/iris.arff";
            ArffLoader arffLoader = new ArffLoader();
            arffLoader.setSource(new FileInputStream(fileName));
            Instances instances = arffLoader.getDataSet();
            instances.setClassIndex(4);
            //System.out.println(instances);
    
            J48 classifier = new J48();
            classifier.buildClassifier(instances);
    
            System.out.println(classifier);
    
            String prefixTreeString = classifier.prefix();
            TreeNode node = processPrefixTreeString(prefixTreeString);
    
            System.out.println("Tree:");
            System.out.println(node.createString());
    
            Model model = createModel(node);
    
            System.out.println("Model:");
            model.write(System.out, "RDF/XML-ABBREV");
        }
    
        private static TreeNode processPrefixTreeString(String inputString)
        {
            String string = inputString.replaceAll("\\n", "");
    
            //System.out.println("Input is " + string);
    
            int open = string.indexOf("[");
            int close = string.lastIndexOf("]");
            String part = string.substring(open + 1, close);
    
            //System.out.println("Part " + part);
    
            int colon = part.indexOf(":");
            if (colon == -1)
            {
                TreeNode node = new TreeNode();
    
                int openAfterLabel = part.lastIndexOf("(");
                String label = part.substring(0, openAfterLabel).trim();
                node.label = label;
                return node;
            }
    
            String attributeName = part.substring(0, colon);
    
            //System.out.println("attributeName " + attributeName);
    
            int comma = part.indexOf(",", colon);
    
            int leftOpen = part.indexOf("[", comma);
    
            String leftCondition = part.substring(colon + 1, comma).trim();
            String rightCondition = part.substring(comma + 1, leftOpen).trim();
    
            int leftSpace = leftCondition.indexOf(" ");
            String leftRelation = leftCondition.substring(0, leftSpace).trim();
            String leftValue = leftCondition.substring(leftSpace + 1).trim();
    
            int rightSpace = rightCondition.indexOf(" ");
            String rightRelation = rightCondition.substring(0, rightSpace).trim();
            String rightValue = rightCondition.substring(rightSpace + 1).trim();
    
            //System.out.println("leftCondition " + leftCondition);
            //System.out.println("rightCondition " + rightCondition);
    
            int leftClose = findClosing(part, leftOpen + 1);
            String left = part.substring(leftOpen, leftClose + 1);
    
            //System.out.println("left " + left);
    
            int rightOpen = part.indexOf("[", leftClose);
            int rightClose = findClosing(part, rightOpen + 1);
            String right = part.substring(rightOpen, rightClose + 1);
    
            //System.out.println("right " + right);
    
            TreeNode leftNode = processPrefixTreeString(left);
            leftNode.relation = leftRelation;
            leftNode.value = leftValue;
    
            TreeNode rightNode = processPrefixTreeString(right);
            rightNode.relation = rightRelation;
            rightNode.value = rightValue;
    
            TreeNode result = new TreeNode();
            result.attribute = attributeName;
            result.children.add(leftNode);
            result.children.add(rightNode);
            return result;
    
        }
    
        private static int findClosing(String string, int startIndex)
        {
            int stack = 0;
            for (int i=startIndex; i<string.length(); i++)
            {
                char c = string.charAt(i);
                if (c == '[')
                {
                    stack++;
                }
                if (c == ']')
                {
                    if (stack == 0)
                    {
                        return i;
                    }
                    stack--;
                }
            }
            return -1;
        }
    
        static class TreeNode
        {
            String label;
            String attribute;
            String relation;
            String value;
            List<TreeNode> children = new ArrayList<TreeNode>();
    
            String createString()
            {
                StringBuilder sb = new StringBuilder();
                createString("", sb);
                return sb.toString();
            }
    
            private void createString(String indent, StringBuilder sb)
            {
                if (children.isEmpty())
                {
                    sb.append(indent + label);
                }
                sb.append("\n");
                for (TreeNode child : children)
                {
                    sb.append(indent + "if " + attribute + " " + child.relation
                        + " " + child.value + ": ");
                    child.createString(indent + "  ", sb);
                }
            }
    
            @Override
            public String toString()
            {
                return "TreeNode [label=" + label + ", attribute=" + attribute
                    + ", relation=" + relation + ", value=" + value + "]";
            }
        }    
    
        private static String createPropertyString(TreeNode node)
        {
            if ("<".equals(node.relation))
            {
                return "lt_" + node.value;
            }
            if ("<=".equals(node.relation))
            {
                return "lte_" + node.value;
            }
            if (">".equals(node.relation))
            {
                return "gt_" + node.value;
            }
            if (">=".equals(node.relation))
            {
                return "gte_" + node.value;
            }
            System.err.println("Unknown relation: " + node.relation);
            return "UNKNOWN";
        }    
    
        static Model createModel(TreeNode node)
        {
            Model model = ModelFactory.createDefaultModel();
    
            String baseUri = "http://www.example.com/example#";
            model.createResource(baseUri);
            model.setNsPrefix("base", baseUri);
            populateModel(model, baseUri, node, node.attribute);
            return model;
        }
    
        private static void populateModel(Model model, String baseUri,
            TreeNode node, String resourceName)
        {
            //System.out.println("Populate with " + resourceName);
    
            for (TreeNode child : node.children)
            {
                if (child.label != null)
                {
                    Resource resource =
                        model.createResource(baseUri + resourceName);
                    String propertyString = createPropertyString(child);
                    Property property =
                        model.createProperty(baseUri, propertyString);
                    Statement statement = model.createLiteralStatement(resource,
                        property, child.label);
                    model.add(statement);
                }
                else
                {
                    Resource resource =
                        model.createResource(baseUri + resourceName);
                    String propertyString = createPropertyString(child);
                    Property property =
                        model.createProperty(baseUri, propertyString);
    
                    String nextResourceName = resourceName + "_" + child.attribute;
                    Resource childResource =
                        model.createResource(baseUri + nextResourceName);
                    Statement statement =
                        model.createStatement(resource, property, childResource);
                    model.add(statement);
                }
            }
            for (TreeNode child : node.children)
            {
                String nextResourceName = resourceName + "_" + child.attribute;
                populateModel(model, baseUri, child, nextResourceName);
            }
        }
    
    }
    

    The program parses the famous Iris dataset from an ARFF file, runs the J48 classifier, builds the tree structure and generates and prints the RDF model. The output is shown here:

    The classifier, as it is printed by Weka:

    J48 pruned tree
    ------------------
    
    petalwidth <= 0.6: Iris-setosa (50.0)
    petalwidth > 0.6
    |   petalwidth <= 1.7
    |   |   petallength <= 4.9: Iris-versicolor (48.0/1.0)
    |   |   petallength > 4.9
    |   |   |   petalwidth <= 1.5: Iris-virginica (3.0)
    |   |   |   petalwidth > 1.5: Iris-versicolor (3.0/1.0)
    |   petalwidth > 1.7: Iris-virginica (46.0/1.0)
    
    Number of Leaves  :     5
    
    Size of the tree :     9
    

    A string representation of the tree structure that is built internally:

    Tree:
    
    if petalwidth <= 0.6:   Iris-setosa
    if petalwidth > 0.6: 
      if petalwidth <= 1.7: 
        if petallength <= 4.9:       Iris-versicolor
        if petallength > 4.9: 
          if petalwidth <= 1.5:         Iris-virginica
          if petalwidth > 1.5:         Iris-versicolor
      if petalwidth > 1.7:     Iris-virginica
    

    The resulting RDF model:

    Model:
    <rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:base="http://www.example.com/example#">
      <rdf:Description rdf:about="http://www.example.com/example#petalwidth">
        <base:gt_0.6>
          <rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth">
            <base:gt_1.7>Iris-virginica</base:gt_1.7>
            <base:lte_1.7>
              <rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength">
                <base:gt_4.9>
                  <rdf:Description rdf:about="http://www.example.com/example#petalwidth_petalwidth_petallength_petalwidth">
                    <base:gt_1.5>Iris-versicolor</base:gt_1.5>
                    <base:lte_1.5>Iris-virginica</base:lte_1.5>
                  </rdf:Description>
                </base:gt_4.9>
                <base:lte_4.9>Iris-versicolor</base:lte_4.9>
              </rdf:Description>
            </base:lte_1.7>
          </rdf:Description>
        </base:gt_0.6>
        <base:lte_0.6>Iris-setosa</base:lte_0.6>
      </rdf:Description>
    </rdf:RDF>