Search code examples
pythonpython-3.xregexconfusion-matrix

Deleting a row on the basis of a particular column value and deleting the instances of that in another file?


I have a ruleset which I have generated from iris_dataset which looks like below

2,3,2,3,1
*,*,*,4,2
2,*,2,2,1
*,1,*,4,2
1,*,*,3,2
*,*,*,1,0
*,*,3,*,2
3,*,*,*,2
2,1,3,2,2
1,1,2,3,2
*,*,3,4,2
3,*,3,*,2
*,*,1,1,0
2,1,3,3,1
2,*,*,3,1
2,2,2,4,1
*,*,*,3,1
*,*,1,*,0
*,*,3,2,2
*,2,2,*,1
*,*,2,2,1

Where the 1st 4 columns are the 4 attribute values, say, a0,a1,a2,a3 and the fifth column is the class value. * stands for don't care. For an example 2,*,*,3,1 means if a0=2 and a3=3 then we don't care about a1 and a2, the class will be = 1.

Now I have a dataset as well (giving at the end of this question), with which I am comparing this ruleset, using the concept of confusion matrix to get the fitness of each rule.

To get the fitness I have written a code :

import re
import sys
dataset = sys.argv[1];
ruleset = sys.argv[2];

with open(ruleset, 'r') as infile:
    rules = [list(map(str, line.split(','))) for line in infile]
   
for i in rules:
    i[-1] = i[-1].replace("\n","")
    
with open(dataset, 'r') as infile:
    rows = [list(map(int, line.split(','))) for line in infile]
new_rules = []

for rule in rules:

    class_ = rule[-1]
    tp = 0
    fn = 0
    tn = 0
    fp = 0
    pat = ''
    
    for i in rule[1:-1]: 
        if i == '*' : pat+='\d'
        else : pat+=i
        
    for element in rows:
        pattern = re.compile(pat)
        element = list(map(str, element))
        mat = pattern.match(''.join(element[:-1]))
        
        if mat:
            if element[-1] == class_:
                tp+=1
            elif element[-1] != class_:
                fp+=1
        else:
            if element[-1] == class_:
                fn+=1
            else:
                tn+=1
    
    print(f"True Positive: {tp}, False Negative: {fn}, True Negative: {tn}, False Positive : {fp}")
    fitness_1 = ((tp+tn)/(tp+tn+fp+fn))
    fitness = "{:.2f}".format(fitness_1)
    new_rules.append(rule + [fitness])
    
item_list = []

for i in new_rules:
    i = list(map(str, i))
    s = ','.join(i)
    item_list.append(s)
with open("final_output_", "w") as outfile:
    outfile.write("\n".join(item_list))
    
f = open("final_output_", "r")
content = f. read()
print(content, sep='\n')
f. close()

Now the final_output_ file looks like:

2,3,2,3,1,0.66
*,*,*,4,2,0.67
2,*,2,2,1,0.71
*,1,*,4,2,0.67
1,*,*,3,2,0.95
*,*,*,1,0,1.00
*,*,3,*,2,0.49
3,*,*,*,2,0.33
2,1,3,2,2,0.67
1,1,2,3,2,0.67
*,*,3,4,2,0.67
3,*,3,*,2,0.49
*,*,1,1,0,0.72
2,1,3,3,1,0.67
2,*,*,3,1,0.39
2,2,2,4,1,0.67
*,*,*,3,1,0.39
*,*,1,*,0,0.22
*,*,3,2,2,0.66
*,2,2,*,1,0.64
*,*,2,2,1,0.71

Where the last column is nothing but the fitness of that particular rule. Now I want to sort this, first according to the class, then according to their fitness, so that after sorting it looks something like:

*,*,*,1,0,1.00
*,*,1,1,0,0.72
*,*,1,*,0,0.22
*,*,2,2,1,0.71
.
.
.
.
3,*,*,*,2,0.33

Then I want to pick the 1st rule(which has the highest fitness) from class 0 and put it in a separate file say ruleset_new and check for that rules existence in the below given dataset. And all the instances from the dataset for that rule will be deleted and a new dataset, say dataset_new will be generated with the remaining rows of the dataset. And the previous ruleset will have now one rule less. Again on this ruleset, fitness will be calculated comparing with dataset_new using the above code. Then in the next time the 1st rule(which has the highest fitness) from class 1 will be selected and will be put in the ruleset_new and the same thing will be repeated until all the instances of the dataset will be covered.

Below is the dataset

3,1,3,3,2
1,2,1,1,0
1,1,1,1,0
2,2,3,4,2
1,3,1,1,0
2,1,2,2,1
1,1,1,1,0
2,1,3,4,2
2,2,2,2,1
1,1,2,2,1
2,1,3,4,2
3,3,3,4,2
2,1,2,2,1
1,1,2,2,1
2,1,3,4,2
2,1,2,2,1
2,1,3,4,2
1,2,1,1,0
1,2,1,1,0
2,2,2,2,1
2,1,2,2,1
2,1,2,4,2
2,3,1,1,0
1,3,1,1,0
2,1,3,4,2
2,1,3,4,2
1,1,2,2,1
1,1,2,2,1
2,1,2,2,1
3,1,3,4,2
2,1,3,4,2
2,2,3,4,2
2,1,3,4,2
2,2,3,4,2
2,3,1,1,0
2,1,2,2,1
3,1,3,4,2
1,3,1,1,0
2,1,3,4,2
2,2,2,2,1
2,1,2,2,1
1,3,1,1,0
1,2,1,1,0
1,1,1,1,0
1,2,1,1,0
2,1,2,2,1
2,2,3,4,2
2,3,3,4,2
2,1,2,2,1
2,1,2,2,1
1,3,1,1,0
2,1,2,2,1
3,1,3,4,2
2,1,3,4,2
2,2,3,4,2
3,1,3,4,2
2,1,2,2,1
2,1,2,2,1
1,3,1,1,0
1,3,1,1,0
1,3,1,1,0
2,1,2,2,1
1,1,1,1,0
1,3,1,1,0
2,1,2,2,1
1,3,1,1,0
1,1,2,2,1
2,1,2,2,1
1,1,2,2,1
1,1,2,3,2
2,1,2,2,1
2,1,3,4,2
1,3,1,1,0
3,1,3,4,2
2,1,3,4,2
3,2,3,4,2
2,1,3,2,2
1,3,1,1,0
2,1,2,4,2
2,1,2,2,1
2,2,3,4,2
1,3,1,1,0
2,1,3,2,2
2,1,2,4,2
2,2,3,4,2
1,3,1,1,0
2,1,2,2,1
1,3,1,1,0
2,1,2,4,2
3,1,3,4,2
1,1,2,2,1
2,3,3,4,2
2,1,3,4,2
2,1,2,2,1
1,3,1,1,0
2,1,2,2,1
1,3,1,1,0
2,1,2,2,1
1,3,1,1,0
1,3,1,1,0
1,3,1,1,0
3,3,3,4,2
2,1,2,2,1
1,3,1,1,0
1,3,1,1,0
2,2,3,4,2
2,1,3,4,2
2,1,3,4,2
1,3,1,1,0
3,1,3,4,2
2,1,3,3,1
1,1,1,1,0
2,2,3,4,2
1,3,1,1,0
2,1,3,3,1
1,2,1,1,0
2,1,2,2,1
2,3,1,1,0
1,1,1,1,0
1,2,1,1,0
3,3,3,4,2
2,1,2,2,1
1,3,1,1,0
1,1,2,2,1
1,2,1,1,0
1,2,1,1,0
2,2,2,2,1
2,1,2,2,1
2,1,2,2,1
2,2,3,4,2
1,1,2,2,1
1,1,2,2,1
2,3,2,3,1
2,2,2,2,1
2,2,2,4,1

For an example here comparing between ruleset and dataset and finding the instances means : 1,*,*,3,2 from the ruleset has 1,1,2,3,2 in the dataset.

Please help me out. I am new to python so unable to figure this out.


Solution

  • I'm not exactly sure what you are asking, since there is no obvious explicit question anywhere in your post. It seems to me that you are uncertain about how to sort your rules, which are list of strings, stored in a list, and you want to sort them on the two last columns with precedence on the second to last column (the one you call "class"). Is this a correct interpretation of your question?

    If this is a correct interpretation then to sort any list in Python you can use the sorted() function, which you can read about in the official python documentation. A possible solution would be:

    sorted_rules_on_fitness = sorted(new_rules,key = lambda rule: rule[-1])
    sorted_rules_on_class = sorted(sorted_rules_on_fitness,key = lambda rule: rule[-2])
    

    By sorting on the item with least precedence first you can exploit that sorted conserve the order that elements with the same key appear in the unsorted list. I hope this helps you, but if you have follow up questions or clarifications on your original question don't hesitate to answer.