I have a dataset on Google Playstore data. It has twelve features (one float, the rest objects) and I would like to manipulate one of them a bit so that I can convert it to numeric form. The feature column I'm talking about is the Size column, and here's a snapshot of what it looks like:
As you can see, it's in a string form consisting of the number with the scale appended to it. Checking through the rest of the feature, I discovered that asides from megabytes (M), there are also some entries in kilobytes (K) and also some entries where the size is the string "Varies according to device".
So my ultimate plan to deal with this is to :
I know how to do 1,2 and 4, but 3 is giving me trouble because I'm not sure how to go about differentiating the k entries from the M ones and dividing those specific entries by 1000. If all of them were M or K, there'd be no issue as I've dealt with that before, but having to discriminate makes it trickier and I'm not sure what form the syntax should take (my attempts continuously throw errors).
By the way if anyone has a smarter way of going about this, I'd love to hear it. This is a learning exercise if anything!
Any help would be greatly appreciated. Thank you!!
------------------------Edit------------------------
A minimum reproducible example of an attempt would be
import pandas as pd
data = pd.read_csv("playstore-edited.csv",
index_col=("App"),
parse_dates=True,
infer_datetime_format=True)
x = data
var = [i[-1] for i in x.Size]
sar = dict(list(enumerate(var)))
ls = []
for i in sar:
if sar[i]=="k":
ls.append(i)
x.Size.loc[ls,"Size"]=x.Size.loc[ls,"Size"]/1000
This throws the following error:
IndexingError: Too many indexers
I know the last part of the code is off, but I'm not sure how to express what I want.
As written in the comment: If you strip the final letter to a new column you can then condition on that column for the division.
df = pd.DataFrame({'APP': ['A', 'B'], 'Size': ['5M','6K']})
df['Scale'] = df['Size'].str[-1]
df['Size'] = df['Size'].str[:-1].astype(int)
df.loc[df['Scale'] == 'K', 'Size'] = df.loc[df['Scale'] == 'K', 'Size'] / 1000
df = df.drop('Scale', axis=1)
df