12.3. Implementing k-armed bandit in Python#

For this simulation, we’ll work with a simple two-armed bandit problem. The code below has been adapted from this excellent article by Ankit Choudhary.

1. Setup#

Suppose you have two levers that return a possible win of 1 point with a given probability. Specifically, in this example:

  • Lever 0 returns 1 point 45 percent of the time it’s pulled; the rest of the time it returns 0

  • Lever 1 returns 1 point 65 percent of the time it’s pulled; the rest of the time it returns 0

We will build and explore the outcomes of three strategies discussed in lecture: random selection, epsilon-greedy, and upper confidence bound. For fun, and in the spirit of “frontiers” of data science, we’ll throw in some (lightly!) interactive plots as well.

import numpy as np
import pandas as pd
from scipy.stats import beta, bernoulli
import random
import math
ACTUAL_EV = [.35, .45]                               # Set the expected value (EV), or payoff, for each lever
print('Actual EV for Lever 0 is:', ACTUAL_EV[0])
print('Actual EV for Lever 1 is:', ACTUAL_EV[1])
Actual EV for Lever 0 is: 0.35
Actual EV for Lever 1 is: 0.45
n = 1000        # we'll do 1000 trials for each strategy

2. Defining a function we’ll use to evaluate each of the below three strategies/algorithms#

def algorithm_performance():          
    '''
    Display the performance of each algorithm
    '''
    ## calculate how many time each lever has been pulled
    count_series = pd.Series(index_list).value_counts(normalize=True)
    print('Lever 0 was pulled', count_series[0]*100, '% of the time.')
    print('Lever 1 was pulled', count_series[1]*100, '% of the time.')
    
    print('Total reward (number of points):', total_reward) ## print total Reward
    print('Win percentage (number of wins/number of pulls):', (total_reward/n)*100, '%')
    
    x = np.arange (0, n, 1)

3. Simulating three strategies#

1. Random selection#

regret = 0 
total_reward = 0
regret_list = []       # list for collecting the regret values for each trial (i.e., pull of the lever)
ev = {0: [], 1: []}    # lists for collecting the calculated EV 
index_list = []        # list for collecting the number of the randomly chosen lever (0 or 1)

# set the initial values for pulls of the lever and wins 
pulls = [0,0] 
wins = [0,0]

np.random.seed(44)

for i in range(n):    
    
    random_index = np.random.randint(0,2,1)[0]     # randomly choose a value between [0,1]
    index_list.append(random_index)                # add that value to the list index_list
    
    pulls[random_index] += 1 # add 1 pull value for the choosen lever
    did_win = bernoulli.rvs(ACTUAL_EV[random_index]) # simulate winning using the actual EV value
    
    if did_win:
        wins[random_index] += did_win         # if we won, add value of 1 for the choosen lever
    
    # calculate the EV values and add them to list
    if pulls[0] == 0:
        ev_0 = 0
    else:
        ev_0 = wins[0]/pulls[0]
        
    if pulls[1] == 0:
        ev_1 = 0
    else:
        ev_1 = wins[1]/pulls[1]
        
    ev[0].append(ev_0)
    ev[1].append(ev_1)
    
    # calculate the regret and reward
    regret += max(ACTUAL_EV) - ACTUAL_EV[random_index]
    regret_list.append(regret)
    total_reward += did_win
algorithm_performance()  
Lever 0 was pulled 50.2 % of the time.
Lever 1 was pulled 49.8 % of the time.
Total reward (number of points): 413
Win percentage (number of wins/number of pulls): 41.3 %
## save the reward and regret values for future comparison
random_dict = {'reward':total_reward, 
               'regret_list':regret_list, 
               'pulls_count':pd.Series(index_list).value_counts(normalize=True)}   

2. Epsilon-greedy#

e = .1          # set the Epsilon value
n_init = 200     # set the number of initial pulls to determine the winning lever
pulls = [0,0]
wins = [0,0]

np.random.seed(44)

for i in range(n_init):
    random_index = np.random.randint(0,2,1)[0]
    index_list.append(random_index)                # add that value to the list index_list
    
    pulls[random_index] += 1
    did_win = bernoulli.rvs(ACTUAL_EV[random_index])
    if did_win:
        wins[random_index] += did_win
        
ev_0 = wins[0] / pulls[0]
ev_1 = wins[1] / pulls[1]

ev[0].append(ev_0)
ev[1].append(ev_1)

win_index = np.argmax([ev_0, ev_1]) ## select the lever with the highest EV

print('After', n_init, 'initial trials, Lever', win_index, 'got the highest EV =', round(np.max([ev_0, ev_1]), 2), 
      '(Real EV is', ACTUAL_EV[win_index], ').'
      '\nIt will be pulled', (1-e)*100, '% of the time.')
After 200 initial trials, Lever 1 got the highest EV = 0.44 (Real EV is 0.45 ).
It will be pulled 90.0 % of the time.
regret = 0 
total_reward = 0
regret_list = [] 
ev = {0: [], 1: []}
index_list = [] 
pulls = [0,0] 
wins = [0,0]

random.seed(44)

for i in range(n):    
    
    epsilon_index = random.choices([win_index, 1-win_index], [1-e, e])[0]
    index_list.append(epsilon_index)

    pulls[epsilon_index] += 1
    did_win = bernoulli.rvs(ACTUAL_EV[epsilon_index])
    if did_win:
        wins[epsilon_index] += did_win
    
    if pulls[0] == 0:
        ev_0 = 0
    else:
        ev_0 = wins[0]/pulls[0]
        
    if pulls[1] == 0:
        ev_1 = 0
    else:
        ev_1 = wins[1]/pulls[1]
        
    ev[0].append(ev_0)
    ev[1].append(ev_1)
    
    regret += max(ACTUAL_EV) - ACTUAL_EV[epsilon_index]
    regret_list.append(regret)
    total_reward += did_win
epsilon_index
1
algorithm_performance()          # evaluate the performance of the epsilon-greedy strategy
Lever 0 was pulled 10.9 % of the time.
Lever 1 was pulled 89.1 % of the time.
Total reward (number of points): 443
Win percentage (number of wins/number of pulls): 44.3 %
epsilon_dict = {'reward':total_reward, 
                'regret_list':regret_list, 
                'pulls_count':pd.Series(index_list).value_counts(normalize=True)}

3. Upper Confidence Bound (UCB)#

regret = 0 
total_reward = 0
regret_list = [] 
index_list = [] 
pulls = [0,0] 
wins = [0,0]
ev = {0: [], 1: []}
total_reward = 0

random.seed(44)

for i in range(n):
    
    index = 0
    max_upper_bound = 0
    for k in [0,1]:
        if (pulls[k] > 0):
            EV = wins[k] / pulls[k]
            delta = math.sqrt(2 * math.log(i+1) / pulls[k])
            upper_bound = EV + delta
            ev[k].append(EV)
        else:
            upper_bound = 1e400
        if upper_bound > max_upper_bound:
            max_upper_bound = upper_bound
            index = k
    index_list.append(index)
    pulls[index] += 1
    reward = bernoulli.rvs(ACTUAL_EV[index])
    
    wins[index] += reward
    total_reward += reward
    
    regret += max(ACTUAL_EV) - ACTUAL_EV[index]
    regret_list.append(regret)
algorithm_performance()      # evaluate the performance of the ucb strategy
Lever 0 was pulled 31.4 % of the time.
Lever 1 was pulled 68.60000000000001 % of the time.
Total reward (number of points): 425
Win percentage (number of wins/number of pulls): 42.5 %
ucb_dict = {'reward':total_reward, 
             'regret_list':regret_list, 
             'pulls_count':pd.Series(index_list).value_counts(normalize=True)}

4. Compare all three algorithms#

print('The total reward for random was', random_dict['reward'])
print('The total reward for e-greedy was', epsilon_dict['reward'])
print('The total reward for UCB was', ucb_dict['reward'])
The total reward for random was 413
The total reward for e-greedy was 443
The total reward for UCB was 425

Based on our analysis, \(\epsilon\)-greedy is the winning strategy! Of course, this could vary depending on how we adjust different specifics in our strategies (e.g., the size of \(\epsilon\)), as well as the circumstances of the problem, including the number of trials, the payoff distributions, and of course the size of \(k\). We encourage you to use this base code to explore alternative versions – are there circumstances where UCB, for example, consistently outperforms \(\epsilon\)-greedy? What kinds?

Finally, we’ve also left out features of the gambler (or octopus). We’ve just assumed the gambler wants to earn the most money, and often that’s a realistic enough assumption. But, what if the gambler is risk-averse – perhaps they will lose morale if they go too many rounds without earning anything, and then drop out of the game? Or, what if the machines are really spread apart, thus increasing the cost of switching between them (either in terms of time, or effort, or something else)? Of course, in the context of a wall of Vegas slot machines, these matters may feel trivial, but in real life they might really impact what strategy is optimal for a person, creature, or machine(!). For example: switching between jobs in real life tends to come with a lot of friction, so it’s not like you can just move between them and “explore” quite so freely in the initial exploration period if you’re following an \(\epsilon\)-greedy strategy. And, many people become discouraged if they don’t see rewards from their decision – whether it’s to go to grad school, work at a particular company, or even move to a new town – which might cause them to stop before discovering that they’ve actually found the “best” (or at least very good) option.

Reinforcement learning is exciting for many reasons – it’s the backbone of a lot of advances in robotics, computers playing games, artificial intelligence, and more. And it’s also exciting because it helps us in turn better understand our own worlds and lives – for example, we can compare computer models to models of human and animal cognition, as Alan Turing first did decades ago. And it can also help us think through some pretty deep philosophical questions about how we all spend our, in the words of poet Mary Oliver, “one wild and precious life.”


(By the way: the correct answer to the famous question she poses at the end of her poem, “The Summer Day”:

“Tell me, what is it you plan to do
with your one wild and precious life?

is obviously “\(\epsilon\)-greedy”.)