Coding for Statistics: Develop an Intuition for Statistics Using Programming

I have always been fascinated by statistics. In my opinion, It is one of those subjects that is useful regardless of what your domain is. It is not only helpful as a tool to understand the world better but also captivating in its own right. Subjects like Bayesian Theorem can really go against our intuitive tendencies and reveal the shortcomings in our thinking.

If you don't know what Bayesian Theorem is, you should read up about it. In short, it tells us that the possibility of occurrence of an event is primarily dictated by the prevalence of that event in the general population. It sounds like an obvious statement when it is put that way, but we have a strong tendency to overlook this.

Say you are exhibiting some inexplicable medical symptoms, and search the internet to understand what that might mean. The internet lets you know, in a typical fashion, that some form of rare cancer might explain your symptoms. Now your inclination might be to believe that you might indeed have a high chance of having that disease since there is an overlap in symptoms. But, controlling for everything else, the probability of you having that problem is still incredibly low even though you have some of those symptoms. Remember that this is a rare disease. You still need to be very unlucky to actually have it. You having those symptoms at the moment doesn't increase the probability as much as your imagination might lead you to think, assuming those symptoms can have other, more common, explanations. This is not to say you shouldn't seek medical advice, but you can actually keep a cool head till you get an explanation and not stress over something you have read on the internet.

The Monty Hall Problem

The Monty Hall Problem, a problem inspired by a popular game show in the 60s, is a statistical puzzle that can be equally confusing.

"Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?" (from Wikipedia)

I don't know about you, but my intuitive answer to this question would be not to switch. I have already made a choice, and what the host does shouldn't have any implication over my initial choice. Why do I care if some other door turned out to be a dud when I didn't pick it anyway.

It turns out you should always switch to the other door. You have 1/3 of a chance of winning the car if you don't switch and 2/3 of a chance if you switch. You double your chances by switching! Don't worry if this doesn't make any sense. This is a famous statistical problem and has confused many people, including hundreds of PhDs and even prolific mathematicians. My first response to reading about the problem was to write a program to illustrate the result. If I can see the problem occurring with my own eyes, then I can develop an understanding of it. Programming brings abstract to reality. Here is how to go about it.

We will first create a function that sets up the game. We can simplify the game setup to be represented by an array that has two strings with the value goat and one string with the value car.

function getDoorSetup() {
  const setup = ['goat', 'goat', 'car'];
  return setup;
}

Remember that the order of these items should be random. We will pick an index number, and that will determine the result. If the value at the picked index of the array is the car, then we win; if not, we lose. To randomize the order of this array, I will be using a JavaScript library called lodash**.** lodash is a popular library with a bunch of utility functions that make working with JavaScript data types easier. It has a function called shuffle that takes an array and returns a new array with the order of the items changed randomly.

function getDoorSetup() {
  const setup = ['goat', 'goat', 'car'];
  return _.shuffle(setup);
}

Next, we want to have a function representing the choice we will make in the game. We will call this function getDoorPick. It will generate a number between 0 and 2. These numbers will correspond to the indices of the door setup array. We will use a lodash function called random that returns a random integer in between given numbers.

function getDoorPick() {
  const doorPick = _.random(0, 2);
  return doorPick;
}

Now we can create a function called playGame that will simulate playing this game. In this function, we will set up the game and pick a door number. If the door number we picked has the value car, we win; if not, we lose. We will return 1 from the function if we win, 0 if we lose.

function playGame() {
  const doorSetup = getDoorSetup();
  const doorPick = getDoorPick();

  if (doorSetup[doorPick] === "car") {
    return 1;
  }

  return 0;
}

This is a fairly simple setup. This is the point where we can start utilizing the power of programming. We can already simulate what happens when we play this game once. But we can also easily simulate what happens when we play this game 1000 times (or even millions of times).

let counter = 0;
let rounds = 100;
let wins = 0;

while (counter < rounds) {
  const result = playGame();
  wins = wins + result;
  counter = counter + 1;
}

console.log(`You have won ${wins} games`);

We have used a while loop that runs the amount that is defined with the rounds variable, which is 100. This simulation will run 100 times. We are using the wins variable to count the number of times we win the game. We console.log the result. Try running the program couple of times. You will notice that the number of times you win will hover around 33. This is 1/3 of 100. The result makes sense since there are 3 options; we have 1 in 3 chances of winning the game.

We now need a function that will return a door number in the doorSetup that will not be our pick, and it is not the number with the car either. I will call this function getGoatNotAtDoorNumber.

function getGoatNotAtDoorNumber(setup, doorNumber) {
  let goatDoorNumber;

  setup.forEach((item, itemNumber) => {
    if (item === "goat" && itemNumber !== doorNumber) {
      goatDoorNumber = itemNumber;
    }
  });

  return goatDoorNumber;
}

We will call this function with the doorSetup and doorPick to simulate the host revealing some other door with a goat behind it.

const revealedGoatPosition = getGoatNotAtDoorNumber(doorSetup, doorPick);

And we will create one more function called getFinalPick. This function will simulate the act of switching or not switching. If we choose not to switch, then we just stick to our original pick. If we switch, we should pick the door number that we didn't pick and the one that isn't revealed. Here is what that function looks like:

function getFinalPick(ourPick, revealedPick, changePick) {
  if (!changePick) {
    return ourPick;
  }

  const possibleDoorNumbers = [0, 1, 2];
  return possibleDoorNumbers.filter((doorNumber) => {
    if (doorNumber !== ourPick && doorNumber !== revealedPick) {
      return true;
    }
  })[0];
}

Using these two functions we can now have a more complete setup of the game.

function playGame() {
  const setup = getDoorSetup();
  const ourDoorPick = getDoorPick();
  const revealedGoatPosition = getGoatNotAtDoorNumber(setup, ourDoorPick);
  const switchDoors = false;
  const ourFinalPick = getFinalPick(
    ourDoorPick,
    revealedGoatPosition,
    switchDoors
  );

  if (setup[ourFinalPick] === "car") {
    return 1;
  }

  return 0;
}

Try changing the switchDoors variable from false to true. You will suddenly see that you are winning twice many games on average.

Isn't this amazing? We have created a computer simulation that can help us build a feel for the problem before developing an understanding of it.

Here is the code for you to tinker with:

Let's take a look at another statistical problem called Birthday Paradox.

Understanding the Birthday Paradox

The Birthday Paradox, according to Wikipedia, states that in a group of randomly chosen people, the probability of having a matching birthday is 50% for a group of 23 people. The counterintuitive part of this problem is that 23 people seems like a low number of people to generate such a high probability. But the math checks out. We can easily show it using programming again.

We can first generate a random number between 1 and 365 to represent a birthday. We can again use lodash to generate the random number.

function getRandomBirthday() {
  return _.random(1, 365);
}

Next, we can create a function called getRandomBirthdays to generate an array of the desired length consisting of these random numbers. We can think of this array representing the birthdays of a roomful of people.

function getRandomBirthdays(length) {
  const randomBirthdays = [];

  for (let i = 0; i < length; i++) {
    const randomBirthday = getRandomBirthday()
    randomBirthdays.push(randomBirthday);
  }

  return randomBirthdays;
}

Now the question is, what is the probability of finding a pair of numbers that match, which means a pair that essentially has the same birthday. We need a function that would tell us if there are matching numbers in our array.

function hasSameBirthday(arr) {
  const arrWithNoDuplicates = [...new Set(arr)];
  if (arrWithNoDuplicates.length !== arr.length) {
    return true;
  }
  
  return false;
}

There are probably a lot of ways of identifying if there is a repeating number in an array. Here, we are first converting the given array to a set. A Set is a collection that is similar to an array but can't have any duplicate entries. Then, we are converting the set back to an array on the same line.

const arrWithNoDuplicates = [...new Set(arr)];

This operation removes any duplicate entries from a given array. It is a neat trick that I sometimes use in my code. Then we compare the length of the resulting array to the original array. If there is a difference, that means the original array had some duplicate entries that got removed. This means the function should return true to indicate there are matching birthdays in the array. If not, we return false.

Now we can create our simulation to see how the probabilities work:

function simulate() {
  const rounds = 100;
  let counter = 0;
  let matches = 0;

  while (counter < rounds) {
    const arr = getRandomBirthdays(23);
    const hasDuplicate = hasSameBirthday(arr);
    if (hasDuplicate) {
      matches = matches + 1;
    }
    counter = counter + 1;
  }

  console.log(`There is ${matches/rounds * 100}% chance that there is match`);
}

simulate();

Running this code we would notice that the number hovers around 50% for 23 numbers. Decreasing that number by half reduces the probabilities almost by a third. We can play with other numbers to see how do the probabilities get affected by the input.

Conclusion

Statistics can be unintuitive and confusing. Programming allows us to cut through the complexity and let us develop a feel for these concepts. I can't claim to have a deep understanding of anything that I wrote about in this post, but thanks to the awesome power of coding, I know better to switch the door if given a chance. If you are curious about other awesome applications of programming you should check out Awesome Coding!