Search code examples
pythonpandasmachine-learningfeature-selectionfeature-engineering

Performing object column manipulation in python


I have a dataset on Google Playstore data. It has twelve features (one float, the rest objects) and I would like to manipulate one of them a bit so that I can convert it to numeric form. The feature column I'm talking about is the Size column, and here's a snapshot of what it looks like:

enter image description here

As you can see, it's in a string form consisting of the number with the scale appended to it. Checking through the rest of the feature, I discovered that asides from megabytes (M), there are also some entries in kilobytes (K) and also some entries where the size is the string "Varies according to device".

So my ultimate plan to deal with this is to :

  1. Strip the last character from all the entries under size.
  2. Convert the convertible entries to floats
  3. Rescale the k entries by dividing them by 1000 so as to represent them properly
  4. Replace the "Varies according to device" entries with the mean of the feature.

I know how to do 1,2 and 4, but 3 is giving me trouble because I'm not sure how to go about differentiating the k entries from the M ones and dividing those specific entries by 1000. If all of them were M or K, there'd be no issue as I've dealt with that before, but having to discriminate makes it trickier and I'm not sure what form the syntax should take (my attempts continuously throw errors).

By the way if anyone has a smarter way of going about this, I'd love to hear it. This is a learning exercise if anything!

Any help would be greatly appreciated. Thank you!!

------------------------Edit------------------------

A minimum reproducible example of an attempt would be

import pandas as pd

data = pd.read_csv("playstore-edited.csv",
                   index_col=("App"),
                   parse_dates=True,
                   infer_datetime_format=True)

x = data

var = [i[-1] for i in x.Size]
sar = dict(list(enumerate(var)))
ls = []
for i in sar:
    if sar[i]=="k":
        ls.append(i)
x.Size.loc[ls,"Size"]=x.Size.loc[ls,"Size"]/1000

This throws the following error:

IndexingError: Too many indexers

I know the last part of the code is off, but I'm not sure how to express what I want.


Solution

  • As written in the comment: If you strip the final letter to a new column you can then condition on that column for the division.

    df = pd.DataFrame({'APP': ['A', 'B'], 'Size': ['5M','6K']})
    df['Scale'] = df['Size'].str[-1]
    df['Size'] = df['Size'].str[:-1].astype(int)
    df.loc[df['Scale'] == 'K', 'Size'] = df.loc[df['Scale'] == 'K', 'Size'] / 1000
    df = df.drop('Scale', axis=1)
    df