Search code examples
pythonstatisticsregressionpercentagefeature-extraction

Which algorithm to use for percentage features in my DV and IV, in regression?


I am using regression to analyze server data to find feature importance.

Some of my IVs (independent variables) or Xs are in percentages like % of time, % of cores, % of resource used, while others are in numbers like number of bytes, etc.

I standardized all my Xs with (X-X_mean)/X_stddev. (Am I wrong in doing so?)

Which algorithm should I use in Python in case my IVs are a mix of numeric and %s and I predict Y in the following cases:

Case 1: Predict a continuous valued Y

a.Will using a Lasso regression suffice?

b. How do I interpret the X-coefficient if X is standardized and is a numeric value?

c. How do I interpret the X-coefficient if X is standardized and is a %?

Case 2: Predict a %-ed valued Y, like "% resource used".

a. Should I use Beta-Regression? If so which package in Python offers this?

b. How do I interpret the X-coefficient if X is standardized and is a numeric value?

c. How do I interpret the X-coefficient if X is standardized and is a %?

If I am wrong in standardizing the Xs which are % already, is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? So that means I do not standardize them, I will still standardize the other numeric IVs.

Final Aim for both Cases 1 and 2:

To find the % of impact of IVs on Y. e.g.: When X1 increases by 1 unit, Y increases by 21%

I understand from other posts that we can NEVER add up all coefficients to a total of 100 to assess the % of impact of each and every IV on the DV. I hope I am correct in this regard.


Solution

  • Your question confuses some concepts and jumbles a lot of terminology. Essentially you're asking about a) feature preprocessing for (linear) regression, b) the interpretability of linear regression coefficients, and c) sensitivity analysis (the effect of feature X_i on Y). But be careful because you're making a huge assumption that Y is linearly dependent on each X_i, see below.

    1. Standardization is not an "algorithm", just a technique for preprocessing data.
    2. Standardization is needed for regression, but it is not needed for tree-based algorithms (RF/XGB/GBT) - with those, you can feed in raw numeric features directly (percents, totals, whatever).
    3. (X-X_mean)/X_stddev is not standardization, it's normalization.
      • (An alternative to that is (true) standardization which is: (X-X_min)/(X_max-X_min), which transforms each variable into the range [0,1]; or you can transform to [0,1].
    4. Last you ask about sensitivity analysis in regression : Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?
      • Stop and think about your underlying linearity assumption in "Final Aim for both cases 1 & 2: To find the % of impact of IVs on Y. Eg: When X1 increases by 1 unit, Y increases by 21%".
      • you're assuming that the Dependent Variable has a linear relationship with each Independent Variable. But that is often not the case, it may be nonlinear. For example, if you're looking at the effect of Age on Salary, you would typically see it increase up to 40s/50s, then decrease gradually, and when you hit retirement age (say 65), decrease sharply.
      • so, you would model the effect of Age on Salary as quadratic or higher-order polynomial, by throwing in Age^2 and maybe Age^3 terms (or else sometimes you might see sqrt(X), log(X), log1p(X), exp(X) etc. terms. Anything that best captures the nonlinear relationship. You may also see variable-variable interaction terms, although regression strictly assumes variables are not correlated with each other.)
      • obviously, Age has a huge effect on Salary, but we would not measure the sensitivity of Salary to Age by combining the (absolute value of) coefficients of Age, Age^2, Age^3.
      • if we only had a linear term for Age, the single coefficient for Age would massively understate the influence of Age on Salary, it would net "average out" the strong positive relationship for the regime Age<40 versus the negative relationship for Age>50
    5. So the general answer to "Can we directly interpret the regression coefficient for X_i as the sensitivity of Y on X_i?" is "Only if the relationship between Y and that X_i is linear, otherwise no".
    6. In general, a better and easier way to do sensitivity analysis (without assuming linear response, or needing standardization of % features) is tree-based algorithms (RF/XGB/GBT) which generate feature importances.
      • As an aside, I understand your exercise tells you to use regression, but in general you get better faster feature-importance information from tree-based (RF/XGB), especially for a shallow tree (small value for max_depth, large value of nodesize e.g. >0.1% of training-set size). That's why people use it, even when their final goal is regression.

    (Your question is would get better answers over at CrossValidated, but it's fine to leave here on SO, there is a crossover).