- Key Sub-elements:
- Beyond the agent and the environment, four main sub-elements define a reinforcement learning system:
- Policy
- Reward Signal
- Value Function
- Model of the Environment (optional)
1. Policy:
- Definition:
- A policy is the agent’s strategy for behaving at any given time.
- It maps perceived states of the environment to actions to be taken when in those states.
- Relation to Psychology:
- A policy can be likened to a set of stimulus-response rules or associations in psychology.
- Complexity:
- Policies can range from simple functions or lookup tables to complex computations, such as those involving search processes.
- Core of RL:
- The policy is central to a reinforcement learning agent, as it alone determines behavior.
- Policies can also be stochastic, meaning they might involve randomness in choosing actions.
2. Reward Signal:
- Definition:
- The reward signal defines the goal in a reinforcement learning problem.
- At each time step, the environment provides the agent with a reward (a single number).
- Objective:
- The agent’s primary goal is to maximize the total reward over the long run.
- The reward signal represents the immediate and defining features of the problem the agent faces.
- Reward Influence:
- The reward received by the agent depends on its actions and the current state of the environment.
- The agent cannot directly alter the reward process but can influence it through its actions.
- Example:
- Phil’s internal reinforcement learning agent might receive different reward signals depending on how hungry he is or his mood when eating breakfast.
- Adaptation:
- If an action yields low reward, the policy may be adjusted to select a different action in the future.
- Reward signals may also be stochastic functions of the state of the environment and the actions taken.
3. Value Function:
- Immediate vs. Long-term:
- The reward signal indicates what is beneficial in the short term, while the value function specifies what is good in the long run.
- Definition:
- The value of a state is the total expected reward an agent can accumulate over time, starting from that state.
- Relation to Rewards:
- Rewards determine the immediate desirability of environmental states.
- Values, however, indicate the long-term desirability, considering the states that are likely to follow and the rewards available in those states.
- Example:
- A state with low immediate reward may have a high value if it regularly leads to states with high rewards.
- Conversely, a state with high immediate reward might have a low value if it leads to less desirable future states.
- Human Analogy:
- Rewards are similar to pleasure (high rewards) and pain (low rewards).
- Values correspond to a refined judgment of how satisfied or dissatisfied we are with our environment’s state.
Rewards vs. Values:
- Primary vs. Secondary:
- Rewards are primary in reinforcement learning because they are directly provided by the environment and serve as the foundation for defining values.
- Values are secondary as they are predictions of future rewards.
- Purpose of Values:
- Without rewards, there would be no values, and the main purpose of estimating values is to achieve more rewards in the long run.
- Decision-Making Based on Values:
- When making and evaluating decisions, values are more important because they guide actions that lead to states with the highest long-term rewards.
- Agents seek actions that bring about states of highest value, not necessarily highest immediate reward, to maximize total rewards over time.
- Difficulty in Estimating Values:
- Determining values is harder than determining rewards because values must be estimated and constantly updated based on observations over the agent’s entire lifetime.
- Importance of Value Estimation:
- Value estimation is a crucial component of most reinforcement learning algorithms.
- Efficient methods for estimating values are considered one of the most significant advancements in reinforcement learning over recent decades.
4. Model of the Environment:
- Definition:
- A model of the environment is a system that mimics how the environment behaves or allows inferences to be made about future states and rewards.
- Use in Planning:
- Models are used for planning, which involves deciding on actions by considering possible future situations before they are actually experienced.
- Given a state and action, the model predicts the next state and the next reward.
- Model-Based vs. Model-Free Methods:
- Methods that use models and planning are referred to as model-based methods.
- Model-free methods rely on trial-and-error learning without planning and are considered the opposite of planning.
- Learning with Models:
- Modern reinforcement learning spans from simple trial-and-error approaches to advanced methods that involve both learning a model and using it for planning.
- Later we will explore reinforcement learning systems that simultaneously learn by trial and error, develop a model of the environment, and use the model for planning, demonstrating the range from low-level learning to high-level planning.
This detailed breakdown highlights the distinction between rewards and values, the importance of value estimation, and the role of models in planning within reinforcement learning systems.