The Legend of AlphaGo Part II: How does it work?

Clare Teng

Before we get into the ‘nuts and bolts’ of the inner workings of AlphaGo, let’s first define some terminology mentioned in part I, namely deep learning (DL), reinforcement learning (RL) and supervised learning (SL).

Deep Learning (DL) is a type of model that uses many stacked and connected neurons or layers. To illustrate how this works let’s look at an example of an image classifier trying to distinguish between a dog and a cat. Broadly speaking, each layer learns a different feature. For example, ear shape in the 1st layer, face shape in the 2nd layer and eye colour in the 3rd. These connections learn relevant information that could be useful for distinguishing between a dog and a cat. Deeper models have more layers and learn more features. In contrast, a shallow model could have just 1 or 2 layers, where the only information learnt for classification is the ear and face shape. In some applications, shallow models are sufficient, but DL models are much more powerful and expressive. 

Reinforcement Learning (RL) is a sub-field of AI which creates systems that learn a series of actions to reach an optimal solution. An easy way to think of RL is of a child learning how to ride a bicycle without training wheels. Initially the child gets on the saddle and almost immediately loses their balance. After falling off, they get up, and try again. Over time, the child learns how to balance themselves through trial and error. Here a series of optimal actions could look like the following: hold the handlebars, keep one foot on the ground, place the other on the pedal. Each individual action returns a reward, resulting in a cumulative score at the end. A less efficient series of actions might return the same final output of a balanced bicycle but will return a lower score. For exmaple if the child became distracted by a dog walker nearby and spent a few actions playing with the dog. 

In contrast, Supervised Learning (SL) looks slightly different. Now imagine instead of having no prior experience, the child has a few examples written out on flash cards for them like those illustrated below. Instead of learning from scratch, the child already knows a series of actions which can lead to a balanced or imbalanced bicycle. Since the inputs (flashcards), and outputs (success/failure) are already defined, the child learns the expected label (success/failure) by studying the patterns in the given flashcards. The final output is achieved when the child is given a different type of bicycle and they still know how to balance. They have learnt how to generalise and apply their learnt skills in a similar but new scenario. 



The first row has 3 flash cards which instructions on how to ride a bicycle, and the last flashcard with a label of success or failure depending on the actions. The first row gives a success example, while the second row gives an example of a failure.

To summarise, RL focuses on the optimal process of learning via trial and error, whilst SL learns by using training examples. Both reach the same final goal but RL cares about the process it took to get there, while SL does not. 

Okay, phew. We’ve now done the hard work on getting all that terminology out of the way, so we are ready to delve into the different components of AlphaGo which come together to create a master Go player.

There are 3 main components to becoming a master:
i) an apprentice learning from a master on how to play winning moves in Go;
ii) the apprentice playing against themselves at a Go table to improve their strategies;
(iii) the apprentice evaluating which moves work best for winning the final game. 

In short, the list tells us the apprentice needs to first learn how to play the game (i); then they need to improve their strategies (ii); and finally, they need to evaluate which of their strategies would be successful in a real game (iii).

Part I: Learning from the experts.

The first part of AlphaGo builds a SL model which learns from a database of expert moves (30 million positions). To continue with our earlier example of SL, imagine an apprentice who has been provided with flash cards containing a series of actions taken by experts. However, instead of the label being success/failure (as in our bicycle example), the output is given as the next move the expert would have taken, given their previous steps. 

Now, this apprentice has many, many flashcards to learn and take in. So, to ease their burden, two different SL models were trained. Both models have the same architecture i.e., design of the layers. The difference is the number of parameters updated and tuned, which is directly correlated to the length of training time required. More parameters require more computation power and time. The 1st apprentice predicts the outcome with 55% accuracy, outperforming previous benchmarks by 10%. The 2nd apprentice predicts with 24% accuracy, but computes the output 3x faster than the former.  

Part II: Learning how to be a better player

Now that the apprentice has learnt strategies from part I, it’s time to improve their gameplay. The eventual aim is for the player to create new moves for themselves, rather than only mimicking experts. Hence, the player uses the less accurate but more efficient model as a starting point. We no longer require the first database of expert moves. 

To improve, the apprentice now plays against themselves, pitting a randomly selected learnt strategy to play against another. This form of self-playing is RL. Each new game allows the player to iterate and update their prior strategies based on what worked and what didn’t. Remember that in RL the process of reaching the goal is just as important as the final goal itself, which means the player is learning how to win the game with an optimal solution that will return the highest score. 

Part III: Evaluating their gameplay

Alright, so we have a player who has learnt their game first from experts, and then improved by playing against themselves repeatedly. The last bit to tackle is evaluating how well they played – what series of moves returns a win? 

The naive solution is to build a model that uses the expert database from part I to evaluate the player’s moves. Since the database’s positions are highly correlated (representing whole game plays rather than individual moves), the final evaluation model would memorise entire games rather than generalise new positions. This phenomenon is known as overfitting. You can think of it as studying for a test by only memorising past exam papers. If the final exam looks exactly like that of one of your practice runs you’ll score really well! However, if it looks different you won’t be able to answer the questions since you didn’t learn why it worked in the first place. 

Instead, for evaluation, the model uses the games played in part ii as the new database. Only 1 position per game is used for training to mitigate the problem of correlation, returning 30 million unique game positions. 

And that’s it! Parts I, II and III – learning, improving and evaluating – are repeated, refined and repeated again until eventually we have a master Go player. In the third and final part of the series we will see AlphaGo put to the test against human players for the first time, and explore what may come next…

There is no consensus on the definition of AI. Broadly speaking, AI programs are systems that think and act like humans

2 comments

Leave a comment