I have a data frame with 906 columns. 160 columns are names of world languages. So, the data frame columns are more or less like this:
[c1,c2,c3,c4,c....,Italian, English, German, French, Albanian, Spanish,... c903, c904, c905, c906]
I know how to sort the columns, but considering all of the columns of the data frame, not only part of it. How can I sort alphabetically in Python only the columns with the names of the languages without changing the order of the other columns?
My desired output should look like this:
[c1,c2,c3,c4,c....,Albanian, English, French, German, Italian, Spanish,... c903, c904, c905, c906]
Many thanks in advance for the help!
Assuming all your non countries columns can be identified and converted to a boolean (here using a regex to match c\d+
, bus this could be anything), you can use numpy.lexsort
:
# identify non-target columns
# the regex here is just an example
# any other method can be used
a = df.columns.str.fullmatch('c\d+')
# array([ True, True, True, True, False, False, False,
# False, False, False, True, True, True, True])
# compute a first sorter with a custom cumsum
# the odd numbers are the original True
pos = (2*a).cumsum()-a
# array([ 1, 3, 5, 7, 8, 8, 8, 8, 8, 8, 9, 11, 13, 15])
# sort by first sorter then column names
df2 = df.iloc[:, np.lexsort((df.columns, pos))]
To get the languages columns from a list of valid languages, it is possible to use langcodes
:
from langcodes import language_lists
lang = language_lists.WIKT_LANGUAGE_NAMES['en']
a = ~df.columns.isin(lang)
# array([ True, True, True, True, False, False, False,
# False, False, False, True, True, True, True])
pos = (2*a).cumsum()-a
df2 = df.iloc[:, np.lexsort((df.columns, pos))]
output:
['c1', 'c2', 'c3', 'c4', 'Albanian', 'English', 'French', 'German',
'Italian', 'Spanish', 'c903', 'c904', 'c905', 'c906']
used input:
cols = ['c1', 'c2', 'c3', 'c4', 'Italian', 'English', 'German',
'French', 'Albanian', 'Spanish', 'c903', 'c904', 'c905', 'c906']
df = pd.DataFrame(columns=cols)