This lab will focus on the two armed bandit problem. Imagine you are in a casino and you are playing a slot machine with two arms. Each arm has a different probability of paying out, but you don't know the associated probabilities. On each trial, one arm will lead to a win and the other arm will lead to a loss. How do you decide which arm to pull and how often? This is a learning problem. You need to learn something about relative payoff rates of the two arms. In this lab, you will write a learning program that does just this. It will "pull the arms" and learn which strategy to follow. The best program should earn the most money gambling.
Write a program that plays games that last a variable number of trials (i.e., times pulling one of the two arms). At a minimum consider games that last 10, 100, or 500 trials. Also, at a minimum, consider arms in which the good vs. bad arm has a probability of payoff of 1.0 vs. .8, .8 vs. .6, and .8 vs. .1.
As for your learning model, it should learn the probability of each arm paying off based on the trials it has so far experienced (i.e., keep a running average). The two probabilities should add to 1. On each trial your model should pick the arm with higher probability of payoff with probability (1.0-alpha) where alpha is the exploration parameter. When alpha is large, the model will be more likely to pull the arm with the lower expected chance of paying off. Explore different levels of alpha for the different rates of payoff and number of trials mentioned above. Make sure to average your results over many simulations (like you did with your t-test lab) to get an accurate picture of what is going on. It might be interesting to use your variance function to get a handle on how different runs of the model lead to different results.
In addition to looking at what percentage of the trials your model made the right choice, also determine what percentage of the time your model figured out which arm was the better arm by the last trial. The random.uniform(0, 1) function (remember to import random) will be helpful for generating random numbers. Please write up a detailed account describing what you did, what you found, and what you think is going on.
Consider the case in which the good arm has a probability of payoff of .7 and the bad arm .5. Find the alpha parameter that maximizes payoff for 250 trials. In this simulation and all previous ones, trials were independent of one another. Let's imagine a case in which the last choice becomes .2 more likely or unlikely than the base probabilities depending on whether it was a winner or not on the last trial. For example, if the bad arm was chosen and it provided a reward, the probability of it returning a reward on the next trial would rise to .7. How can you improve your model to take advantage of the sequential dependency across trials? How does the new model do? How does the new model do on the original problem in which the probabilities stay at .7 and .5? Please write up a detailed account describing what you did, what you found, and what you think is going on.