Search code examples
python-3.xcsvorange

Orange Data Table with specific domain


I’m trying to create an Orange Data Table from a csv-file. To achieve this I'm currently trying to do this using the following steps:

  1. Create the target domain
  2. Reading the file to a temporary data table
  3. Creating a new data table using the data in the temp table and the target domain

Changing the csv to a tab-file with a three line header (https://docs.orange.biolab.si/3/data-mining-library/reference/data.io.html) is not an option.

When translating this procedure to code I get the following:

  from Orange.data import Domain, DiscreteVariable, ContinuousVariable, Table

    # Creating specific domain. Two attributes and a Class variable used as target
    target_domain = Domain([ContinuousVariable.make("Attribute 1"),ContinuousVariable.make("Attribute 2")],DiscreteVariable.make("Class"))
    print('Target domain:',target_domain) 
    # Target domain: [Attribute 1, Attribute 2 | Class]

    # Reading in the file
    test_data = Table.from_file('../data/knn_trainingset_example.csv')
    print('Domain from file:',test_data.domain)
    # Domain from file: [Attribute 1, Attribute 2, Class]

    # Using specific domain with test_data
    final_data = Table.from_table(target_domain,test_data)

    print('Domain:',final_data.domain)
    print('Data:')
    print(final_data)
    # Domain: [Attribute 1, Attribute 2 | Class]
    # Data:
    # [[0.800, 6.300 | ?],
    #  [1.400, 8.100 | ?],
    #  [2.100, 7.400 | ?],
    #  [2.600, 14.300 | ?],
    #  [6.800, 12.600 | ?],
    #  [8.800, 9.800 | ?],
    # ...

As you can see from the final print statement the class variable is unknown (?) instead of the expected class (+ or -).

Can someone explain/solve this behavior? Provide a better/different way to create a Data Table with a specific domain?


Solution

  • Yep, thanks! As described in the reference (https://docs.orange.biolab.si/3/data-mining-library/reference/data.variable.html#discrete-variables), you have to supply the possible valeus. So providing those as a tuple did the trick. For future reference I placed the adjusted code below.

    from Orange.data import Domain, DiscreteVariable, ContinuousVariable, Table
    
    # Creating specific domain. Two attributes and a Class variable used as target
    target_domain = Domain([ContinuousVariable.make("Attribute 1"),ContinuousVariable.make("Attribute 2")],DiscreteVariable.make("Class",values=('+','-')))
    
    print('Target domain:',target_domain)
    # Target domain: [Attribute 1, Attribute 2 | Class]
    
    # Reading in the file
    test_data = Table.from_file('../data/knn_trainingset_example.csv')
    
    print('Domain from file:',test_data.domain)
    # Domain from file: [Attribute 1, Attribute 2, Class]
    
    print('Data:')
    print(test_data)
    # [[0.800, 6.300 | −],
    #  [1.400, 8.100 | −],
    #  [2.100, 7.400 | −],
    #  [2.600, 14.300 | +],
    #  [6.800, 12.600 | −],
    #  [8.800, 9.800 | +],
    # ...
    
    # Using specific domain with test_data
    final_data = Table.from_table(target_domain,test_data)
    
    print('Domain:',final_data.domain)
    # Domain: [Attribute 1, Attribute 2 | Class]
    
    print('Data:')    
    # Data:
    # [[0.800, 6.300 | −],
    #  [1.400, 8.100 | −],
    #  [2.100, 7.400 | −],
    #  [2.600, 14.300 | +],
    #  [6.800, 12.600 | −],
    #  [8.800, 9.800 | +],
    # ...