Search code examples
pythonpandassortingmultiple-columns

Sorting by name part of the columns in a data frame without changing the position of the other columns in python pandas?


I have a data frame with 906 columns. 160 columns are names of world languages. So, the data frame columns are more or less like this:

[c1,c2,c3,c4,c....,Italian, English, German, French, Albanian, Spanish,... c903, c904, c905, c906]

I know how to sort the columns, but considering all of the columns of the data frame, not only part of it. How can I sort alphabetically in Python only the columns with the names of the languages without changing the order of the other columns?
My desired output should look like this:

[c1,c2,c3,c4,c....,Albanian, English, French, German, Italian, Spanish,... c903, c904, c905, c906]

Many thanks in advance for the help!


Solution

  • Assuming all your non countries columns can be identified and converted to a boolean (here using a regex to match c\d+, bus this could be anything), you can use numpy.lexsort:

    # identify non-target columns
    # the regex here is just an example
    # any other method can be used
    a = df.columns.str.fullmatch('c\d+')
    # array([ True,  True,  True,  True, False, False, False,
    #        False, False, False,  True,  True,  True,  True])
    
    # compute a first sorter with a custom cumsum
    # the odd numbers are the original True
    pos = (2*a).cumsum()-a
    # array([ 1,  3,  5,  7,  8,  8,  8,  8,  8,  8,  9, 11, 13, 15])
    
    # sort by first sorter then column names
    df2 = df.iloc[:, np.lexsort((df.columns, pos))]
    

    To get the languages columns from a list of valid languages, it is possible to use langcodes:

    from langcodes import language_lists
    
    lang = language_lists.WIKT_LANGUAGE_NAMES['en']
    
    a = ~df.columns.isin(lang)
    # array([ True,  True,  True,  True, False, False, False,
    #        False, False, False,  True,  True,  True,  True])
    
    pos = (2*a).cumsum()-a
    df2 = df.iloc[:, np.lexsort((df.columns, pos))]
    

    output:

    ['c1', 'c2', 'c3', 'c4', 'Albanian', 'English', 'French', 'German',
           'Italian', 'Spanish', 'c903', 'c904', 'c905', 'c906']
    

    used input:

    cols = ['c1', 'c2', 'c3', 'c4', 'Italian', 'English', 'German',
            'French', 'Albanian', 'Spanish', 'c903', 'c904', 'c905', 'c906']
    df = pd.DataFrame(columns=cols)