Reinforcement Learning

In the previous lecture professor Barreto gave an overview of artificial intelligence. The lecture encompassed a variety of techniques though one in particular seems to be increasingly prevalent in the media and peaked my interest, “reinforcement learning”.Having limited exposure to machine learning I wanted to learn more about how reinforcement learning works, what differentiates it and what has caused its recent surge in popularity.

Reinforcement learning is a category of machine learning algorithms that focuses on a connection between an action and an outcome, without necessarily analyzing or learning anything thing about the complex relationships between the actions taken and the subsequent result[1]. In other words, at the most basic level reinforcement learning can be thought of as an exhaustive trial and error approach. Generally reinforcement algorithms revolve around three concepts. States, actions, and rewards:
state = Current situation like the positions of pieces on a Go board
action = What can be done in each state. Typically a finite set
reward = Abstract concept describing feedback from environment. Positive or negative where positive feedback is often referred to as “reward”[1].

When addressing manageable state and action sets in terms of short term reward, the technique is relatively simple. A look up table referred to as a Q table or Q function is populated with a reward value for each state, action pair. After the table has been populated the program or robot can simply consult the table and retrieve the optimal action based on its current state. For problems of manageable scope with a dedicated training period an exhaustive approach can be used through sufficient training where all possible states can be visited to populate the Q table. Oftentimes however this is not the case and the population of the Q table requires a balance in a trade off of exploration vs exploitation[1]. When a program finds itself in a given state does it take the best current known option (found by consulting the Q table, “exploitation”) or does it experiment with an unknown option in an attempt to add information to the Q table and hopefully discover a more rewarding state action pair (“exploration”)? The answer depends heavily on the implementation. If the program undergoing a learning or training phase before deployment it can be programmed to lean toward exploration to populate the Q table. If the program is learning while operating then one must take into consideration the consequences of a suboptimal action, as the real world implications of an “experimenting” self driving car in rush hour traffic is very different from that of a computer attempting to beat the high score on an Atari machine.

This seemingly straightforward concept becomes even more convoluted when applications require long term “memories” of actions. Tak as an example a robot programmed with a reinforcement learning algorithm attempting to navigate a maze. If the maze is one turn, then the example is trivial. The more complex and the longer the maze gets, the more actions must be “remembered” and considered as state. The robot cannot hit a dead end after taking a left turn and populate it’s Q table with a low reward value for all left turns, it must take into account the turns taken previously meaning problems that consider long or even intermediate term reward drastically increase the size and complexity of the set of states. Furthermore, the set of actions can also easily become unmanageable. A popular example of reinforcement learning is AlphaGo, a machine learning program taught to play the complex game of Go. In Go there can be as many as 10^170 possible future states, meaning that attempting to store and compute an exhaustive list of state action pairs is unrealistic given the limitations of modern computing[1]. If the number of states and actions is unmanageable it is possible to use techniques such as the Monte Carlo method to reduce the memory and computation required. The Monte Carlo method is a technique that derives an average through repeated sampling as opposed to the actual exploration of the entire space. This is very similar to the technique used to program AlphaGo[1]. Even with more advanced techniques the startling speed at which problem spaces become unmanageable has played a large role in the fact that reinforcement learning is not a panacea even though the concept is relatively old.

That the core concepts behind reinforcement learning are not new is not surprising as they are based on the way humans and animals learn new behaviors in the real world. Over 100 years ago Edward Thorndike, a psychologist, documented an experiment where he would place cats inside boxes that could only be escaped from by the pressing of a lever. Initially the cats would simply pace until they pressed the lever by accident, but after repeated trials they would learn to associate the lever with the opening of the box and would escape more quickly with each trial. As far back as 1951 an undergraduate at Harvard named Marvin Minsky, who would later go on to become a well known MIT professor established in the field of ML, built a machine “that used a simple form of reinforcement learning to mimic a rat learning to navigate a maze”[2]. Why then, if these concepts would no longer be considered novel, are they making their way into the business plans of recently funded startups and stealing headlines?

The answer (as it so often seems to be in the AI space these days) is deep learning using neural nets. Newer methods of reinforcement learning are incorporating neural networks (Q networks[3]) to “learn successful policies directly from high dimensional sensory inputs for end to end reinforcement learning”[3] resulting in the ability to generalize past experiences with new situations. The project referenced in the science journal Nature[3] used pixels processed via a Q network as state information to train a machine to play a set of 49 Atari 2600 games and achieve the competency of professional humans. The number of states and pixels on the screen mean that this would be unmanageable using traditional Q table methods. Osaro, a popular startup in the space raised 3.3 million in seed funding using a similar combination of deep learning for perception and reinforcement learning for control[4]. One way this particular startup is differentiating itself is by modifying algorithms to account for an initial human given demonstration. The demonstration of a competent approach by a human serves to give the algorithms a starting point, and can drastically reduce the number of learning trials needed. This allows the machine to reach competency quickly and then focus training time and resources on mastery[4]. Using deep learning to process state related data and recognize patterns as opposed to attempting to populate Q tables with all possibilities has drastically increased the scope of tasks that can be addressed using reinforcement learning. Examples of previously unmanageable tasks that are currently being addressed include the training of robots in factories, increasing the energy efficiency of “smart” datacenters, and, of course, training self driving cars[2]. The incorporation of deep learning with reinforcement learning has expanded the space of addressable problems but also our ambitions and expectations. As a result cutting edge approaches are still struggling to cope with unmanageable datasets and reward models. There is a vast difference in the complexity of algorithms and long term reward systems used to train a program to beat an Atari game and those whose outcome is to train a car to simultaneously plot efficient routes, avoid accidents, and not cause nearby drivers to get into accidents as a result of their actions.

Like many areas of current research, the amalgamation of cutting edge techniques and age old concepts is producing exciting results and expanding technological horizons. Reinforcement learning has been dusted off and reanimated to be at the forefront of research and industry, opening the door for a previously unattainable future of increasingly smart and capable machines.

http://dilbert.com/strip/2016-06-21

[1]https://www.oreilly.com/ideas/reinforcement-learning-explained
[2]https://www.technologyreview.com/s/603501/10-breakthrough-technologies-2017-reinforcement-learning/
[3]http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
[4]http://www.osaro.com/advanced-machine-learning-company-osaro-launches-with-33-million-seed-funding/

1+

Users who have LIKED this post:

  • avatar

4 comments on “Reinforcement Learning”

  1. Great introductory article to RL. Andrej Karpathy ‘s post (http://karpathy.github.io/2016/05/31/rl/) is also worth checking out – it’s quite fascinating to see how RL fits in with the rest of ML – particularly supervised learning. He compares RL to supervised learning, except with RL, we kind of simulate the labels based on the reward function. It’s cool to think that all we need to do is define the rules of the game, the reward function, and just let the simulation run over and over again in order to train for RL. It’s also interesting to see how supervised learning can be combined with RL – AlphaGo used supervised learning to teach the model to predict human moves from expert Go games, and then combined it with RL to form the policy.

    0
    1. Thanks Aaron,

      That post is definitely worth checking out, I like the level of detail it goes into about different techniques. It seems like everything I read about recently stops at just “AI” and what I find really interesting are the different techniques and how they are used in combination to achieve more ambitious goals. RL is such a simple idea, but to actually use it for something like self driving cars has proven to be very complicated and requires the incorporation of other strategies. The note about the dual techniques for AlphaGo is a great example

      0
  2. Hello Kyle,

    I just wanted to say that I always new you from being one of the first people to post for each week, and it was very nice to virtually meet you in class this week.

    Good work,
    Sara D.

    0
  3. Thanks Sara,

    I try to get most of my schoolwork done over the weekend so I have less to juggle with work. Taking classes remotely is a strange experience so I was also happy to interact with the class live for a change.

    0

Comments are closed.