Carrot and Stick

A Framework to Learn Reinforcement Learning

A while ago I went to a Meetup about Reinforcement Learning (RL), I got into a conversation with some one that sat next to me. He asked me several question about the subject - What is the difference between RL and supervised/unsupervised learning? What is the difference between several types of algorithms? When would you choose this framework over another one?

I must admit, it took me a while to answer these questions, and not in the most precise manner. I realized that I really need to give my RL tool belt a spit shine. And so, I started searching for a framework where I can easily implement RL algorithms/agents, test them on different scenarios and compare between them.

I went through multiple frameworks, but none where the perfect fit. All the frameworks where either too complicated or lacked key features that I was looking for. At this point there was no other choice but to build my own framework - Carrot and Stick - the Reinforcement Learning Playground for all your Reinforcement Learning needs.

In this post I wish to do several thing:

  1. Answer the questions I was asked in the Meetup
  2. Introduce the Carrot and Stick framework
  3. A walk-through implementing the Hill Climb algorithm using the Carrot and Stick framework

Lets start with answering the questions.

What is Reinforcement Learning?

Reinforcement Learning is a machine learning paradigms for learning sequential decision making tasks. It differs from supervised learning and unsupervised learning in terms of goals.

In a reinforcement learning problems, the goal is to find a good policy, an action for each state, maximizing some notion of cumulative reward.

In unsupervised learning the goal is to find similarities and differences between data-points.

In supervised learning, similar to reinforcement learning, the goal is to maximize a “score” (accuracy, recall, ETC…). However, because each decision is independent, we have a label associated with each decision, opposed to reinforcement learning, where labels are associated with sequences.

What is this Framework

When talking about a RL algorithm we are almost always referring to an agent, an entity interacting with the environment, trying to reach some goal.

This framework aims to give you the ability to implement an agent, from scratch or by tweaking an existing one.

The framework does not stop there - implementing an agents is nice, but how do we know that it actually works? How do we know that it performs? The framework lets agent play a game in an environment to see how well it does.

In this framework, I have implemented several agents and worlds where the agent will play, to be used as reference or as a benchmark. In this post and the next ones I will pick a RL algorithm, and go over it’s implementation and show how it performs in a chosen world.

Modules in the Framework

The Carrot and Stick framework is comprised of four different modules. Each module holds a base class with several methods and properties. In this section I will go over each of the modules, explain how they are used, and elaborate on their base classes.


As stated above, the agent interacts with the environment according to the feedback it has received so far. Each agent has a decision module, this is the heart of the agent and what distinguishes it from other agents. The agent holds two main functions:

  1. get_action - this function receives a state and returns the chosen action according to the decision model.
  2. reinforce - this function receives a transition(s) - comprised of the current state, the action, the reward, the next state and a signal stating once the game has ended, and updated the decision model.

Decision Models

The decision model is essentially the current policy holding the following functions:

  1. get_action - this function returns the chosen action according to the decision model and the provided state.
  2. get_random_action - this function returns a random action, usually for exploration.
  3. update_model - this function receives a transition(s) and updates the decision model accordingly.


As we said earlier, we wish to test our agent and see how it performs. For this we will run a game, comprised of two main attributes:

  1. A world - the environment in which the agent act. This is essentially an OpenAI gym environment of your choice.
  2. The number of episodes that we wish the agent to play.

Each game is comprised of episodes in which the agent plays. Each episode is comprised of steps, where in each step the agent interacts with the environment, receives the feedback, which includes the reward, the next step and the done signal (indicating if and episode is done). Once the done signal is on the episode ends and a new one starts.


Each world module holds the following functions:

  1. reset - the first step in each episode. This method returns the initial state of the agent.
  2. step - this method receives an action and returns the feedback to the agent.
  3. render - renders the current state of the environment (visualizes the current step).
  4. interact_with_world - a wrapper function for the step function, this is specific for each world.

Example - Hill climb

Now that we know what does what in our framework, we can move to the interesting part of our post - a live example.

We will start with one of the most simple RL algorithms - the Hill Climb algorithm. For this example we will choose the Cart Pole environment, where the object of the game is to balance a pole on a cart for the longest period of time. We will not get into too many details about the environment itself and only describe it briefly:

  • The agent actions are basically applying a force of +1 or -1 to the cart.
  • A reward of +1 is provided for every step that the pole remains upright.
  • The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
  • We are using a discrete version of the environment in order to reduce the amount of steps.

We will not get into the details of Cart Pole environment. One of the main idea of this post is to show how we can implement an agent without too much understanding of what the features of the environment are. However, if you do wish to dive deeper please have a look at the repository here.


The main idea of the algorithm is to start with a random policy, play according to it and save the cumulative reward at the end of the episode. At each episode the policy is update randomly, will show exactly how shortly, and if the new policy achieves a better cumulative reward it replaces the old one. This policy update method is used till the last episode.


In this section we will deep dive and go into details on all the moving parts.

Lets start with the game itself.

In the following code snippet we focus on the run method of the Hill Climbing Cart Game.

This is the method that we execute when we wish to run the game:

As we previously mentioned, in each episode the agent receives the feedback from the World and chooses the action according to the Decision Model. At the end of each episode the Decision Model is updated using the agent’s reinforce method. The reinforce method’s input is a Transition - this is a named tuple used to package all the relevant information.

Now that we know how the game is played , lets focus on the agent’s part. The agent job is very simple, given the state and what it has already learned - choose the right action to take. The agents implementation is very straight forward, there are two methods:

  1. get_action - which receives the state as an input and chooses the action according to the Decision Model.
  2. reinforce - which receives the Transition with all the relevant data and updates the Decision Model.

And now we dive deep to the interesting part (pam pam pam) - The Hill Climb Decision Model. The Decision Model holds the current policy which is implemented as a F by A weight matrix. Where F is the number of features (4 in our case) and A is the number of available actions (2 in our case).

Whenever the agent needs to take an action, it is given a state comprised of the F features. The best action according to the current policy is chosen with the following function, where the input is a state:

We will not get into a deep analysis on why this yields the right action and only explain this roughly. In each episode we modify the policy (weights) that the agent acts according to, if the new policy achieves a better cumulative reward - the weights are updated. The idea is to explore a new policy and try to “Climb” to a better cumulative reward, hence the name…

The update function is also very straight forward. The decision model maintains the best_reward, which is the best cumulative reward that the agent has achieved in an episode. And the best_weights which is the weights matrix that achieved the best_reward.

At the end of each episode, the decision model is updated. If the new cumulative reward is better then the best_reward we update the best_reward with the new one along with best_weights.

And now we need to explore a new policy. We do this by adding a random matrix to the best_weights matrix. The “amount of exploration” (similar to learning rate) is governed by a factor α.

If we are “climbing” in the right direction, meaning that the best_reward improves we decrease α, but if we do not increase at a certain episode, we want to “explore” hence we will increase α (α is capped at a low and high bound).

Now lets see how:


The objective of the current game is to balance the pole on the cart for as long as possible. In order to measure our algorithm’s performance we can look at the mean cumulative reward and also at the number of episodes it took us to reach a certain score.

In the Game module, at the end of the run function we print several stats about the game. In order to get an understanding of how good this algorithm performs we will need to compare it to an alternative - and this we will do in the next posts. For now, clone the repository, and run the game to see how it performs.

Cart Pole
Image Credits

Conclusion and next Steps

In this post we have started with answering what is Reinforcement Learning, how it differs form other machine learning paradigms. We have introduced the Carrot and Stick framework and showed an example implementing the Hill CLimb algorithm. In the next posts we will go over more algorithms and compare between them.

I encourage you to clone the repository, run several examples and try to implement your own. Learning by practicing always got me the fastest results.

© 2019. All rights reserved