Search code examples
algorithmdata-miningdecision-treefuzzy-logic

How is the class center for a decision attribute calculated in class center based fuzzification algorithm?


I came across class center based fuzzification algorithm on page 16 of this research paper on TRFDT. However, I fail to understand what is happening in step 2 of this algorithm (titled in the paper as Algorithm 2: Fuzzification). If someone could explain it by giving a small example it would certainly be helpful.


Solution

  • It is not clear from your question which parts of the article you understand and IMHO the article is written in not the clearest possible way, so this is going to be a long answer.

    Let's start with some intuition behind this article. In short I'd say it is: "let's add fuzziness everywhere to decision trees".

    How a decision tree works? We have a classification problem and we say that instead of analyzing all attributes of a data point in a holistic way, we'll analyze them one by one in an order defined by the tree and will navigate the tree until we reach some leaf node. The label at that leaf node is our prediction. So the trick is how to build a good tree i.e. a good order of attributes and good splitting points. This is a well studied problem and the idea is to build a tree that encode as much information as possible by some metric. There are several metrics and this article uses entropy which is similar to widely used information gain.

    The next idea is that we can change the classification (i.e. split of the values into a classes) as fuzzy rather than exact (aka "crisp"). The idea here is that in many real life situations not all members of the class are equally representative: some a more "core" examples and some a more "edge" example. If we can catch this difference, we can provide a better classification.

    And finally there is a question of how similar the data points are (generally or by some subset of attributes) and here we can also have a fuzzy answer (see formulas 6-8).

    So the idea of the main algorithm (Algorithm 1) is the same as in the ID3 tree: recursively find the attribute a* that classifies the data in the best way and perform the best split along that attribute. The main difference is in how the information gain for the best attribute selection is measured (see heuristic in formulas 20-24) and that because of fuzziness the usual stop rule of "only one class left" doesn't work anymore and thus another entropy (Kosko fuzzy entropy in 25) is used to decide if it is time to stop.

    Given this skeleton of the algorithm 1 there are quite a few parts that you can (or should) select:

    • How do you measure μ(ai)τ(Cj)(x) used in (20) (this is a measure of how well x represents the class Cj with respect to attribute ai, note that here being not in Cj and far from the points in Cj is also good) with two obvious choices of the lower (16 and 18) and the upper bounds (17 and 19)

    • How do you measure μ(x, y) used in (16-19). Given that R is induced by ai this becomes μ(ai)τ(x, y) which is a measure of similarity between two points with respect to attribute ai. Here you can choose one of the metrics (6-8)

    • How do you measure μCi(y) used in (16-19). This is the measure of how well the point y fits in the class Ci. If you already have data as fuzzy classification, there is nothing you should do here. But if your input classification is crisp, then you should somehow produce μCi(y) from that and this is what the Algorithm 2 does.

    There is a trivial solution of μCj(xi) = "1 if xi ∈ Cj and 0 otherwise" but this is not fuzzy at all. The process of building fuzzy data is called "fuzzification". The idea behind the Algorithm 2 is that we assume that every class Cj is actually some kind of a cluster in the space of attributes. And so we can measure the degree of membership μCj(xi) as the distance from the xi to the center of the cluster cj (the closer we are, the higher the membership should be so it is really some inverse of a distance). Note that since distance is measured by attributes, you should normalize your attributes somehow or one of them might dominate the distance. And this is exactly what the Algorithm 2 does:

    1. it estimates the center of the cluster for class Cj as the center of mass of all the known points in that class i.e. just an average of all points by each coordinate (attribute).

    2. it calculates the distances from each point xi to each estimated center of class cj

    3. looking into formula at step #12 it uses inverse square of the distance as a measure of proximity and just normalizes the value because for fuzzy sets Sum[over all Cj](μCj(xi)) should be 1