Search code examples
python-3.xpandasdataframedistutils

How to pandas sort a dataframe that has one of the columns as version numbers?


I am a beginner in Python. I am trying to sort a dataframe on versions column -

print(df)

        versions         memory
0         10.0.0         107.07
1       10.0.0.1         110.11
2         10.0.2         110.56
3         10.0.4         116.27
4         10.1.0         116.17
5         10.1.2         113.98
6         10.1.4         113.27
7         10.2.0         103.32
8          9.9.2         112.02
9        9.9.2.3         112.28
10         9.9.4         114.45
11       9.9.4.1         114.32

such that the resulting dataframe is -

        versions         memory
0          9.9.2         112.02
1        9.9.2.3         112.28
2          9.9.4         114.45
3        9.9.4.1         114.32
4         10.0.0         107.07
5       10.0.0.1         110.11
6         10.0.2         110.56
7         10.0.4         116.27
8         10.1.0         116.17
9         10.1.2         113.98
10        10.1.4         113.27
11        10.2.0         103.32

The datatype of the versions columns is object.

I tried doing -

df = df.sort_values('versions')

But that leaves the dataframe unchanged.

Also tried doing -

from distutils.version import StrictVersion
df = sorted(df['ghost_version'], key=StrictVersion)

but it errors out on ValueError: invalid version number '10.0.0.1'

Can someone please guide?

Thanks


Solution

  • You can create tuples filled by integers, get indices of sorted values by Series.argsort and change order by DataFrame.iloc, last for defaut index use DataFrame.reset_index:

    df1 = (df.iloc[df['versions'].apply(lambda x: tuple(map(int, x.split(".")))).argsort()]
             .reset_index(drop=True))
    

    Or use helper column with DataFrame.sort_values and last remove column:

    df['tmp'] = df['versions'].apply(lambda x: tuple(map(int, x.split("."))))
    df1 = df.sort_values('tmp').drop('tmp', axis=1).reset_index(drop=True)
    

    Also for order of sorted values is possible use LooseVersion:

    from distutils.version import LooseVersion
    
    df1 = df.iloc[df['versions'].apply(LooseVersion).argsort()].reset_index(drop=True)
    
    
        versions  memory
    0      9.9.2  112.02
    1    9.9.2.3  112.28
    2      9.9.4  114.45
    3    9.9.4.1  114.32
    4     10.0.0  107.07
    5   10.0.0.1  110.11
    6     10.0.2  110.56
    7     10.0.4  116.27
    8     10.1.0  116.17
    9     10.1.2  113.98
    10    10.1.4  113.27
    11    10.2.0  103.32