Search code examples
powerbipowerquerycategorical-datam

Convert categorical variables to numeric PowerQuery


I have a number of columns in my table which have text values that fall into categories - e.g. column "ABC" has 9000 rows but every row must have a value in the set {"A","B","C"}. Other columns like Gender have "M"/"F"/null

For each column, I'd like to convert it into an integer list in-place - so A:1, B:2, C:3 etc.

I've been trying out using List.Distinct to extract the values to a temp table, adding an index column to that and using a join to transform the initial column based on that mapping in the temp table. However this seems slow and I'm not sure how to run this over all columns in my table (or at least Table.ColumnsOfType(Source, {type nullable text}) to select the categorical columns...).

Any suggestions?

Before

Gender Fruit [...]
F Cat
F Dog
M Lemon
M Dog
M Lemon
null Cat
M Dog

After

Gender Fruit [...]
1 1
1 2
2 3
2 2
2 3
null 1
2 2

Solution

  • In PowerQuery, this seems to work for any number of columns

    Replace all nulls with something else, here +=+

    Add Index

    Unpivot

    Remove duplicates

    Group, add index to each group

    enter image description here

    Merge back into original and expand

    Repivot

    Remove extra columns

    Before and After:

    enter image description here

    Full code:

    let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Replaced Value" = Table.ReplaceValue(Source,null,"+=+",Replacer.ReplaceValue,Table.ColumnNames(Source)),
    #"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1),
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"),
    
    // derive a table of replacements
    #"Removed Duplicates" = Table.Distinct(#"Unpivoted Other Columns", {"Attribute", "Value"}),
    #"Grouped Rows" = Table.Group(#"Removed Duplicates", {"Attribute"}, {{"GRP", each Table.AddIndexColumn(_, "Index2", 1, 1), type table}}),
    #"Expanded GRP" = Table.ExpandTableColumn(#"Grouped Rows", "GRP", {"Value", "Index2"}, {"Value", "Index2"}),
    
    //replace originals
    #"Merged Queries" = Table.NestedJoin(#"Unpivoted Other Columns",{"Attribute", "Value"},#"Expanded GRP",{"Attribute", "Value"},"EG",JoinKind.LeftOuter),
    #"Expanded Table1" = Table.ExpandTableColumn(#"Merged Queries", "EG", {"Index2"}, {"Index2"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded Table1",{"Value"}),
    #"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Attribute]), "Attribute", "Index2", List.Sum),
    #"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
    in #"Removed Columns1"