I am using regression to analyze server data to find feature importance.
Some of my IVs (independent variables) or Xs are in percentages like % of time, % of cores, % of resource used, while others are in numbers like number of bytes, etc.
I standardized all my Xs with (X-X_mean)/X_stddev
. (Am I wrong in doing so?)
Which algorithm should I use in Python in case my IVs are a mix of numeric and %s and I predict Y in the following cases:
Case 1: Predict a continuous valued Y
a.Will using a Lasso regression suffice?
b. How do I interpret the X-coefficient if X is standardized and is a numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a %?
Case 2: Predict a %-ed valued Y, like "% resource used".
a. Should I use Beta-Regression? If so which package in Python offers this?
b. How do I interpret the X-coefficient if X is standardized and is a numeric value?
c. How do I interpret the X-coefficient if X is standardized and is a %?
If I am wrong in standardizing the Xs which are % already, is it fine to use these numbers as 0.30 for 30% so that they fall within the range 0-1? So that means I do not standardize them, I will still standardize the other numeric IVs.
Final Aim for both Cases 1 and 2:
To find the % of impact of IVs on Y. e.g.: When X1 increases by 1 unit, Y increases by 21%
I understand from other posts that we can NEVER add up all coefficients to a total of 100 to assess the % of impact of each and every IV on the DV. I hope I am correct in this regard.
Your question confuses some concepts and jumbles a lot of terminology. Essentially you're asking about a) feature preprocessing for (linear) regression, b) the interpretability of linear regression coefficients, and c) sensitivity analysis (the effect of feature X_i on Y). But be careful because you're making a huge assumption that Y is linearly dependent on each X_i, see below.
(X-X_mean)/X_stddev
is not standardization, it's normalization.
(X-X_min)/(X_max-X_min)
, which transforms each variable into the range [0,1]; or you can transform to [0,1].sqrt(X)
, log(X)
, log1p(X)
, exp(X)
etc. terms. Anything that best captures the nonlinear relationship. You may also see variable-variable interaction terms, although regression strictly assumes variables are not correlated with each other.)(Your question is would get better answers over at CrossValidated, but it's fine to leave here on SO, there is a crossover).