Search code examples
matlabprobabilitycumulative-sumprobability-densitycdf

PDF and CDF plot for central limit theorem using Matlab


I am struggling to plot the PDF and CDF graphs of where

Sn=X1+X2+X3+....+Xn using central limit theorem where n = 1; 2; 3; 4; 5; 10; 20; 40 I am taking Xi to be a uniform continuous random variable for values between (0,3).

Here is what i have done so far - 
close all
%different sizes of input X
%N=[1 5 10 50];
N = [1 2 3 4 5 10 20 40];

%interval (1,6) for random variables
a=0;
b=3;

%to store sum of differnet sizes of input
for i=1:length(N)
    %generates uniform random numbers in the interval
    X = a + (b-a).*rand(N(i),1);
    S=zeros(1,length(X));
    S=cumsum(X);
    cd=cdf('Uniform',S,0,3);
    plot(cd);
    hold on;
end
legend('n=1','n=2','n=3','n=4','n=5','n=10','n=20','n=40');
title('CDF PLOT')
figure;

for i=1:length(N)
%generates uniform random numbers in the interval
    X = a + (b-a).*rand(N(i),1);
    S=zeros(1,length(X));
    S=cumsum(X);
    cd=pdf('Uniform',S,0,3);
    plot(cd);
    hold on;
end
legend('n=1','n=2','n=3','n=4','n=5','n=10','n=20','n=40');
title('PDF PLOT')

My output is nowhere near what I am expecting any help is much appreciated.


Solution

  • This can be done with vectorization using rand() and cumsum().

    For example, the code below generates 40 replications of 10000 samples of a Uniform(0,3) distribution and stores in X. To meet the Central Limit Theorem (CLT) assumptions, they are independent and identically distributed (i.i.d.). Then cumsum() transforms this into 10000 copies of the Sn = X1 + X2 + ... where the first row is n = 10000copies of Sn = X1, the 5th row is n copies of S_5 = X1 + X2 + X3 + X4 + X5. The last row is n copies of S_40.

    % MATLAB R2019a
    % Setup
    N = [1:5 10 20 40];    % values of n we are interested in
    LB = 0;                % lowerbound for X ~ Uniform(LB,UB)
    UB = 3;                % upperbound for X ~ Uniform(LB,UB)
    n = 10000;             % Number of copies (samples) for each random variable
    
    % Generate random variates
    X = LB + (UB - LB)*rand(max(N),n);     % X ~ Uniform(LB,UB)    (i.i.d.)
    Sn = cumsum(X); 
    

    You can see from the image that the n = 2 case, the sum is indeed a Triangular(0,3,6) distribution. For the n = 40 case, the sum is approximately Normally distributed (Gaussian) with mean 60 (40*mean(X) = 40*1.5 = 60). This shows the convergence in distribution for both the probability density function (PDF) and the cumulative distribution function (CDF).

    Note: The CLT is often stated with convergence in distribution to a Normal distribution with zero mean as it has been shifted. Shifting the results by subtracting mean(Sn) = n*mean(X) = n*0.5*(LB+UB) from Sn gets this done.

    Image showing convergence of Central Limit Theorem (CLT) as a sum converges in distribution

    Code below isn't the gold standard but it produced the image.

    figure
    s(11) = subplot(6,2,1)  % n = 1
        histogram(Sn(1,:),'Normalization','pdf')
        title(s(11),'n = 1')
    s(12) = subplot(6,2,2)
        cdfplot(Sn(1,:))
        title(s(12),'n = 1') 
    s(21) = subplot(6,2,3)   % n = 2
        histogram(Sn(2,:),'Normalization','pdf')
        title(s(21),'n = 2')
    s(22) = subplot(6,2,4)
        cdfplot(Sn(2,:))
        title(s(22),'n = 2') 
    s(31) = subplot(6,2,5)  % n = 5
        histogram(Sn(5,:),'Normalization','pdf')
        title(s(31),'n = 5')
    s(32) = subplot(6,2,6)
        cdfplot(Sn(5,:))
        title(s(32),'n = 5') 
    s(41) = subplot(6,2,7)  % n = 10
        histogram(Sn(10,:),'Normalization','pdf')
        title(s(41),'n = 10')
    s(42) = subplot(6,2,8)
        cdfplot(Sn(10,:))
        title(s(42),'n = 10') 
    s(51) = subplot(6,2,9)   % n = 20
        histogram(Sn(20,:),'Normalization','pdf')
        title(s(51),'n = 20')
    s(52) = subplot(6,2,10)
        cdfplot(Sn(20,:))
        title(s(52),'n = 20') 
    s(61) = subplot(6,2,11)   % n = 40
        histogram(Sn(40,:),'Normalization','pdf')
        title(s(61),'n = 40')
    s(62) = subplot(6,2,12)
        cdfplot(Sn(40,:))
        title(s(62),'n = 40') 
    sgtitle({'PDF (left) and CDF (right) for Sn with n \in \{1, 2, 5, 10, 20, 40\}';'note different axis scales'})
    
    for tgt = [11:10:61 12:10:62]
        xlabel(s(tgt),'Sn')
        if rem(tgt,2) == 1
            ylabel(s(tgt),'pdf')
        else                           %  rem(tgt,2) == 0
            ylabel(s(tgt),'cdf')
        end
    end
    

    Key functions used for plot: histogram() from base MATLAB and cdfplot() from the Statistics toolbox. Note this could be done manually without requiring the Statistics toolbox with a few lines to obtain the cdf and then just calling plot().


    There was some concern in comments over the variance of Sn.

    Note the variance of Sn is given by (n/12)*(UB-LB)^2 (derivation below). Monte Carlo simulation shows our samples of Sn do have the correct variance; indeed, it converges to this as n gets larger. Simply call var(Sn(40,:)).

    % with n = 10000
    var(Sn(40,:))         % var(S_40) = 30   (will vary slightly depending on random seed)
    (40/12)*((UB-LB)^2)   % 29.9505            
    

    You can see the convergence is very good by S_40:

    step = 0.01;
    Domain = 40:step:80;
    
    mu = 40*(LB+UB)/2;
    sigma = sqrt((40/12)*((UB-LB)^2));
    
    figure, hold on
    histogram(Sn(40,:),'Normalization','pdf')
    plot(Domain,normpdf(Domain,mu,sigma),'r-','LineWidth',1.4)
    ylabel('pdf')
    xlabel('S_n')
    

    Derivation of mean and variance for Sn:

    Derivation of mean and variance for Sn
    For the expectation (mean), the second equality holds by linearity of expectation. The third equality holds since X_i are identically distributed.


    The discrete version of this is posted here.