I am analysing a set of group members on social media. I have 4 parameters that I plan to include into the analysis: age, sex, city and education level. The problem is that I can make up to ~25% of total group members with all 4 parameters in place, because most of the profiles do not contain all 4 pieces of information.
The goal is to get a general sense of demographics among this group: sex composition, distribution by education level, city and age group. My data contains many missing values, and they occur in every one of these fields.
I am new to data analysis and struggle to choose the right approach to solving the problem of missing data. I have thought of such options:
{age: NaN, sex: female, education: NaN, city: NaN}
will be counted towards sex distribution but no other. On the contrary, record {age: 24, sex: female, education: Secondary, city: London}
will be counted towards all four indices. Considering that, sex distribution may reflect almost 99% of actual subscribers, but age distribution will be analysed based on data from only 25% of total group members count.So I hope to receive an advice on what approach should I choose in this particular situation.
I am using python library pandas for analysis, but I welcome any suggestions regarding overall directions :)
Given the background you have provided us, I think it's important to emphasize that the data you have consists of categorical features, and as far as I can tell there isn't any order in how your data set is collected.
This rules out the possibility of using the pandas .pad
method for filling missing values because this method forward fills or backfills missing values. To illustrate a common use case for the pad method, consider some random signal timeseries data where some signal value drops to 0 before becoming null.
time signal_value
0 2023-01-01 00:00:00 5.0
1 2023-01-01 00:00:01 2.0
2 2023-01-01 00:00:02 1.0
3 2023-01-01 00:00:03 0.0
4 2023-01-01 00:00:04 0.0
5 2023-01-01 00:00:05 NaN
6 2023-01-01 00:00:06 NaN
7 2023-01-01 00:00:07 NaN
8 2023-01-01 00:00:08 NaN
9 2023-01-01 00:00:09 0.0
10 2023-01-01 00:00:10 0.0
In this case, it makes sense to forward or backfill the missing values.
But if you have a bunch of unordered data from social media accounts:
age sex education city
0 24.0 male None None
1 25.0 female None None
2 NaN None None None
3 NaN None None None
4 22.0 None None None
5 NaN male None None
6 33.0 female None None
I don't think it makes sense to use forward or backfill to replace missing values in the age
or sex
categories (not to mention education
or city
, which seem impossible to fill unless you have more information, such as a relatively rare case like a single-gender university).
Therefore, I think you need to choose between dropping all rows with any missing value, or analyzing the features separately. If you analyze the features separately, you will have people that count toward the analysis of one feature but not another so you should probably first determine if there is any pattern in the missing values. For example, if no one from City A
has their age listed but this city is composed of predominantly older people with higher than average levels of education, you may be skewing your data if you analyze something like education level by age. Determining patterns in missing data is well summarized in this article, which I encourage you to read.
That being said, you can perform an analysis based on dropping rows with missing data in any column versus the same analysis based on dropping rows with missing data in only one column and see if the analysis is similar – this could a be reasonable proxy to determine if your data is missing completely at random. If your data is completely missing at random, then with a large enough data set, you can assume the distribution of each feature should stay approximately the same, and you can analyze each feature separately.