Reinforcement Learning: How Optimistic Initial Values effect Exploration [Part 8]

Multi-Armed Bandit Problem

The multi-armed bandit problem is a classic example in reinforcement learning and decision theory. It is framed around the idea of a gambler facing multiple slot machines (bandits), each with an unknown probability of payout. The gambler must decide which machine to play in order to maximize the total reward over time. This involves a trade-off between exploration (trying different machines to discover their payout rates) and exploitation (playing the machine that has the highest known payout rate).

Problem Setup:

We have three slot machines $(actions ( a_1 ), ( a_2 ), ( a_3 ))$ .
The true expected rewards (which are unknown to us initially) are:
$( q(a_1) = 0.5 )$
$( q(a_2) = 0.7 )$
$( q(a_3) = 0.9 )$

Initial Action-Value Estimates

Let’s consider two cases:

Case 1: Realistic Initial Values

We set the initial action-value estimates to 0:

$( Q_1(a_1) = 0 )$
$( Q_1(a_2) = 0 )$
$( Q_1(a_3) = 0 )$

Case 2: Optimistic Initial Values

We set optimistic initial action-value estimates:

$( Q_1(a_1) = 5 )$
$( Q_1(a_2) = 5 )$
$( Q_1(a_3) = 5 )$

Step 1: Action Selection and Learning

Realistic Initial Values

In the first step, since all actions have the same initial value $( Q_1(a) = 0 )$ , the algorithm randomly chooses one action. Let’s say it chooses $( a_1 )$ .
The true expected reward for $( a_1 )$ is $( q(a_1) = 0.5 )$ , but the actual observed reward $( r )$ may vary around this true mean due to randomness. Suppose you observe $( r = 0.4 )$ .
The estimate is updated using:
$[Q_2(a_1) = Q_1(a_1) + \alpha \cdot (r - Q_1(a_1))]$
Assuming $( \alpha = 0.1 )$ (a small learning rate):
$[Q_2(a_1) = 0 + 0.1 \times (0.4 - 0) = 0.04]$
Now, $( Q_2(a_1) = 0.04 )$ , while $( Q_2(a_2) = 0 ) and ( Q_2(a_3) = 0 )$ . The algorithm might still pick $( a_1 )$ because it has the highest estimate so far, leading to more exploitation of $( a_1 )$ and less exploration of other actions.

Optimistic Initial Values

Initially, all actions have an estimated value of $( Q_1(a) = 5 )$ . The algorithm randomly picks one action $(let's say ( a_1 ))$ .
The true expected reward for $( a_1 )$ is $( q(a_1) = 0.5 )$ , and you observe a reward $( r = 0.4 )$ (as before).
The estimate is updated:
$[Q_2(a_1) = 5 + 0.1 \times (0.4 - 5) = 5 - 0.46 = 4.54]$
Now, $( Q_2(a_1) = 4.54 )$ , but $( Q_2(a_2) = 5 )$ and $( Q_2(a_3) = 5 )$ . The algorithm is likely to pick $( a_2 )$ or $( a_3 )$ next because they still have higher estimates, leading to exploration.

Step 2: Convergence Over Time

Realistic Initial Values: The algorithm may get stuck in a suboptimal action because the initial estimates do not encourage exploration.
Optimistic Initial Values: The algorithm quickly explores other actions because the high initial estimates lead to “disappointment” when actual rewards are lower, driving the algorithm to try other options.

Summary:

With realistic initial values, the algorithm’s learning is driven more by the rewards observed, potentially leading to slower exploration.
With optimistic initial values, the algorithm is pushed to explore more actions early on, which can lead to quicker discovery of the best action.

Multi-Armed Bandit Problem

Optimistic Initial Values

The section discusses how the initial estimates of the action values $( Q_1(a) )$ (the expected rewards for taking action $( a )$ ) can influence the performance of the learning algorithm.

Initial Action-Value Estimates:

In many learning methods, the initial estimates of action values $( Q_1(a) )$ are crucial. If these estimates are set too high or too low, they can bias the learning process.
For example, if you set $( Q_1(a) )$ to a high value (optimistic initial values), it encourages the algorithm to explore different actions more thoroughly because the algorithm initially expects high rewards from all actions. This exploration can help the algorithm discover the true action values more quickly.

Bias and Exploration:

It is mentioned that methods with constant step-size parameters $( \alpha )$ (as in some forms of learning algorithms) retain this bias throughout learning because they do not fully converge to the true action values.
The optimistic initial values essentially act as a built-in mechanism to encourage exploration. If the initial action values are optimistic (e.g., $( Q_1(a) = 5 )$ ), the learner will initially be “disappointed” with the rewards it receives since they are likely lower than expected. This disappointment leads the algorithm to try other actions.

Greedy vs. ( \varepsilon )-Greedy Methods:

The section compares the performance of a greedy method with optimistic initial values $(( Q_1(a) = 5 )$ , $( \varepsilon = 0 ))$ to a more realistic $( \varepsilon )$ -greedy method $(( Q_1(a) = 0 )$ , $( \varepsilon = 0.1 ))$ . The greedy method with optimistic values starts by exploring more due to its initial high expectations but eventually settles into exploitation. The $( \varepsilon )$ -greedy method balances exploration and exploitation more evenly but might not perform as well initially due to the lack of an optimistic bias.

Practical Implications:

The technique of setting optimistic initial values is simple but effective, especially in stationary problems (where the reward probabilities don’t change over time). However, it is less effective in nonstationary problems (where the reward probabilities can change) because the exploration it encourages is temporary.

The figure illustrates the effect of optimistic initial values on the performance of a learning algorithm in a 10-armed bandit problem. The black line shows the performance of a greedy method with optimistic initial values $( Q_1(a) = 5 )$ , while the gray line shows the performance of an $( \varepsilon )$ -greedy method with more realistic initial values $( Q_1(a) = 0 )$ .
The optimistic method performs better initially due to increased exploration but eventually performs similarly to the $( \varepsilon )$ -greedy method as the initial bias diminishes over time.

Conclusion

Optimistic initial values are a simple yet effective way to encourage exploration in reinforcement learning algorithms. By setting higher-than-realistic initial values for expected rewards, the algorithm is nudged into exploring different actions before settling into exploiting the best-known action. This approach is particularly useful in stationary environments but may not be as beneficial in nonstationary environments where exploration needs to be ongoing.