For Loop vs Vectorization in R

A brief comparison between for loop and vectorization in R

A short post to illustrate how vectorization in R is much faster than using the common for loop.

In this example I created two vectors a and b witch will take some random numbers.

I’ll compute the sum of a and b using the for loop and the vectorization approach and then compare the execution time taken by both of the different methods.

I’ll repeat this test 10 times with a different vector size, in the first test the vectors a and b contain only 10 elements but in the last test they contain 10 million elements.

Here is the code for the loop version, when i=1 n=10 so we loop 10 times and when i=10 n=10,000,000 hence we loop 10 million times.

For Loop

loop

I’ve stored the execution time taken for each test in the vector c.loop.time and I printed the last execution time when n=10 million. It took around 11 seconds to compute the sum of 10 millions values, let’s if we can do better with the vectorization method.

Vectorization

vectorization

With the vectorization method it took only around 0.05 seconds just five hundredths of a second, this is a two hundreds time faster than the for loop version!

Result

time taken

This massive difference is mainly because in the for loop method we’re doing the computation at the R level resulting in many more function calls that all need to be interpreted and compiled (especially the variable affectation which occurs 10 million times).

In the vectorization method the computation happens within the compiled code (C or Fortran I’m not too sure) and hence R has far less function to interpret and far fewer calls to compiled code.

Central Limit Theorem (example using R)

The Central Limit Theorem is probably the most important theorem in statistics.

The central limit theorem (CLT) states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the original population.

Furthermore, the CLT states that as you increase the number of samples and the sample size the better the distribution of all of the sample means will approximate a normal distribution (a “bell curve“) even if the original variables themselves are not normally distributed.

Let’s break down this with some examples using R:

Original Population with a left skewed distribution

Let’s generate our left skewed distribution in R.

By using the rbeta function below I generated 10,000 random variables between 0 and 1 and I deliberately changed the shape parameter to have a distribution with a negative skewness.
myRandomVariables<-rbeta(10000,5,2)*10

left skewed hist

The mean (µ) of the total population is 7.13  and the standard deviation (σ)  is 1.61.
We can see that the distribution has a tail longer on the left with some data that go up to 4 standard deviation away from the mean where as the data on the right don’t go beyond 2σ away from the mean.

As we can see on the plot above the standard deviation (σ)  allow us to see how far away from the mean each data are.
A small σ means that the values in a statistical data set are close to the mean of the data set, on average, and a large σ means that the values in the data set are farther away from the mean, on average.
AS the σ is 1.61 it mean that all the data between 5.52 (7.13-1.61) and 8.74 (7.13+1.61) are close the the mean (less than 1 σ).
However the data less than 2.30 (7.13-3*1.61) are much more far from the mean at least 3σ.

To better illustrate let’s see the same plot with the data scaled such as the mean is equal to 0 and the standard deviation is equal to 1.
The formula to get the data normalised is (x-µ) / σ

left skewed norm hist

The distribution still has exactly the same shape but it’s just make it easier to observe how the data are close or far from the mean.

Using the Central Limit Theorem to Construct the Sampling Distribution

So how can we use the CLT to construct the sampling distribution. We’ll use what we know about the population and our proposed sample size to sketch the theoretical
sampling distribution.

The CLT states that:

  • Shape of the sampling distribution: As long as our sample size is sufficiently large (>=30  is the most common but some textbook use 20 or 50) we should assume the distribution of the sample means to be approximately normal disregarding the shape of the original distribution.
  • The mean of the distribution (x̅): The mean of the sampling distribution should be equal to the mean of the original distribution.
  • The standard error of the distribution (σx): The standard deviation of the sample means can be estimated by dividing the standard deviation of the original population by the square root of the sample size.  σx = σ/√n

 

Let’s prove it then!

I will first draw 100 mean samples from the original population with the minimum size recommended by the CLT  30.
Here is the code to generate the sample means:
sampling

So according to the CLT theorem the three following statements should be true:

  1. The mean of our sample means distribution should be around 7.13
  2. The shape of the sample distribution should be approximately normal
  3. Standard error (σx = σ/√nshould be equal to 0.29 (1.61/√30)

sampling 100

  1. The mean is 7.15 hence nearly 7.13
  2. The shape is approximately normal still a little bit left-skewed
  3. The standard error is 0.3 hence nearly 0.29

The CLT also states that as you increase the number of samples the better the distribution of all of the sample means will approximate a normal distribution.

Let’s draw more samples.

Now I take 1,000 samples means and plot them.

sample 1000

  1. The mean is still 7.15 and not exactly 7.13
  2. The shape is approximately normal but still a little bit left-skewed
  3. The standard error is equal to 0.29 as estimated by the CLT theorem

Let’s take even more sample means.

This time I take 10,000 samples means and plot them.

sample 10000

  1. The mean is now exactly 7.13
  2. The distribution shape is definitely normal
  3. The standard error is equal to 0.29 as estimated by the CLT theorem

Just for the fun let’s do another example and this time with a different sample size to see if we get the standard error right.
So using the CLT theorem the σx should be 0.11 (1.61/√200)

sample size 200 10000

We have increased each sample size to 200 instead of 30 in the previous examples hence the variance in the sample means distribution has decreased and we now have a standard error smaller.
This confirms that as we increase the sample size the distribution becomes more normal and also the curve becomes taller and narrower.

Summary

The CLT confirms the intuitive notion that as we take more samples or as we increase the sample size , the distribution of the sample means will begin to approximate a normal distribution even if the original variables themselves are not normally distributed, with the mean equal to the mean of the original population (x̅=µ)  and the standard deviation of the sampling distribution equal to the standard deviation of original the population divided by the square root of the sample size (σx = σ/√n).

 

In the next post I’ll talk more in depth about the CLT and we’ll see how and where we can use the CLT.