python numpy scipy statistics probability

Approximating the conditional expectation E(X|Y)

I'm trying to find a way to approximate the conditional expectation E(X|Y) in python. All I have are two lists of numbers, X and Y and I don't have any other knowledge about them.

Is there a good method for doing so? I tried using all kind of smoother functions but the result was not good at all.

For example,
X = [ 1, 1, 1, 1, ....] and Y = [1, -1, 1, -1, ...] I would expect that E(X|Y) = 1 because X is constant regardless of Y.

Solution

Your question is actually very broad, and thus the are many answers to your question as asked. Taking a broad view of the term machine learning, this is a question that people have been trying to answer in machine learning for a very long time. The answer really depends on the assumptions that you make about the distribution of X|Y and the relationship between X and Y.

I will suggest the sci-kit learn package because it gives a number of ways to calculate E(X|Y). Lets start with your example and use simple linear regression (this makes the assumption that the relationship between X and Y is linear)

import numpy as np
from sklearn.linear_model import LinearRegression
X = np.ones((100,1))
Y = np.ones((100,1))
Y[0:-1:2] = -1
LR = LinearRegression()
LR.fit(Y,X)
a = LR.coef_
b = LR.intercept_

Now in this case since we assumed a linear relationship E(X|Y) = aY + b and in the example above the linear regression tells us that a = 0 and b = 1. This gives you the conditional expectation that you expect (all expectation values equal to 1 regardless of the value of Y).

As I mentioned though there are many many ways of trying to estimate E(X|Y), for instance k-nearest neighbors can be used (and is implemented in sklearn) or Gaussian Process Regression (GPR) both of which do not require the assumption of a linear relationship between variables (the work very differently of course, where GPR requires a kernel function which we know from Mercer's theorem allows us to pick varying levels of smoothness for the function that is fitted to the data, whereas k-nearest neighbors estimates E(X|Y) at every point by averaging over its k-nearest neighbors as measured by some distance function). And there are many many more examples which you could explore in this wonderful package. It is a very interesting field!

Edit: Now I will add some example code for KNN and Gaussian Process Regression

import numpy as np

from sklearn.neighbors import KNeighborsRegressor

X = np.ones(100)
Y = np.ones(100)

Y[0:-1:2] = -1
X = X.reshape(1,-1)
Y = Y.reshape(1,-1)

neigh = KNeighborsRegressor(n_neighbors=1)

neigh.fit(Y, X)
print(neigh.predict(Y))

This returns all ones, again as expected. Note that in the case of KNN E(X|Y) = Ave(neighbors(X)) and thus does not need to be linear (or even smooth for that matter).

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel

X = np.ones(100)
Y = np.ones(100)

Y[0:-1:2] = -1
X = X.reshape(1,-1)
Y = Y.reshape(1,-1)

kernel = DotProduct() + WhiteKernel()
gpr = GaussianProcessRegressor(kernel=kernel).fit(Y,X)

print(gpr.predict(Y))

The above example was for Gaussian process regression and it results in an array of numbers that is very close to (but not exactly) 1 as expected. Note here that the relationship between the variables is assumed to have some level of smoothness depending on the kernel. In this case a constant kernel should have worked fine, but I figured we could have an example where of how to choose the kernel built in.