Generative Deep Learning - The Key To Unlocking Artificial General Intelligence Meetup #LondonAI
This meetup video was recorded in London on February 28th, 2020.
Generative modeling is one of the hottest topics in AI. Itās now possible to teach a machine to excel at human endeavors such as painting, writing, and composing music. In this talk, we will cover:
- A general introduction to Generative Modelling
- A walkthrough of one of the most utilized generative deep learning models
- the Variational Autoencoder (VAE)
- Examples of state-of-the-art output from Generative Adversarial Networks (GANs) and Transformer based architectures.
- How generative models can be used in a reinforcement learning setting (World Models paper)
- Why I believe generative models will play a crucial part in the quest to build Artificial General Intelligence (AGI)
Bio:
David Foster is a Founding Partner of Applied Data Science Partners (https://adsp.ai/), a data science consultancy building innovative AI solutions for clients. He holds an MA in Mathematics from Trinity College, Cambridge, UK and an MSc in Operational Research from the University of Warwick. David has won several international machine learning competitions and is the author of the best-selling book āGenerative Deep Learning: Teaching Machines to Paint, Write, Compose and Playā. He has also authored several successful blog posts on deep reinforcement learning including āHow To Build Your Own AlphaZero AI using Kerasā.
Read the Full Transcript
David Foster:
My name is David Foster. I am the author of a book called Generative Deep Learning. I decided about a year and a half ago that I wanted to write a book about something I felt was going to be absolutely huge in the next year, two years, three years. And that has certainly come true. Itās in the public eye more than ever now before. I think with things like deep fakes, with things like [GBT2], the model that was released by open AI to do text generation, and what that could mean for things like fake news.
It really is more prominent than ever, and we have a couple of copies here this evening, so get thinking of questions. There will be a giveaway at the end for the best question. Itās a first edition. Theyāre all first editions because I havenāt run a second edition yet, but if you put it on Ebay, you get more if you say itās the first edition. Cool. So, Generative Deep Learning.
This book really came out of the desire really to write about something that I think most of us as data scientists, if thatās what you do as your profession, you donāt get the chance perhaps to build generative deep learning models as part of your daily work. You may build machine learning models for sure. But things like deep learning and generative deep learning are hobbyist topics at the moment. And I think in years to come, that may change, but I wanted to write a book that was just about the pure fascination of building things that make machines human.
So, whether thatās text generation, whether thatās image generation, or music generation, this is a subject that just captures the imagination. When I was writing this book, Iām the co founder of a company called Applied Data Science. Weāre a London based data science consultancy. Please do visit our website. Weāre ADSP.AI, applied data science partners. Youāll see there some of the case studies of things that we do. I know youāre all on the wifi now, so do check it out.
And we basically, we build bespoke data science solutions for companies. So weāre not a platform, we just hire fantastic data scientists, data engineers and data analysts, and we build bespoke machine learning and data science solutions for companies. So we are hiring as well, so if any of you are looking for a move, please do check out our website even more for those job specs. So the subject of this talk really, weāre going to go on a real expedition through generative deep learning, right the way from first principles.
So we donāt assume any knowledge of what generative deep learning is, all the way through to cutting edge GDL, generative deep learning today. And weāll cover these five topics with hyphen zero indexing there. So, intro to generative deep learning, weāll just cover what weāre trying to achieve by doing this. Weāll then cover something called a variational autoencoder, which I think is a really great entry point into this subject. Itās very easy to get bogged down by [GANs] quite quickly if thatās the first thing you come across with generative deep learning. Iād recommend if youāre getting started on this, start with [VAEs].
We will cover GANS as well in this talk. Before moving on to something a bit more speculative, weāre looking at something called the World Models Paper by David Ha and Jurgen Schmidhuber that came out in 2018, and this is a fascinating example of not just using generative deep learning for creation of something, but actually using it within a reinforcement learning setting. And they showed that actually it was possible for a machine to learn within its own dreams of what the environment might do in the future.
And that involved actually a variational autoencoder to do that. So weāll take a look at that. I want to finish with something thatās going to get you thinking on the way home on the tube about what this might mean for artificial general intelligence. Itās a subject everyone talks a lot about, but itās so intangible, we donāt really know what this means at the moment. But I think generative deep learning is something that we can get a grasp of, and itās something that I think will form the basis of our endeavors into this in the future. And so weāll just talk speculatively about what that might mean. So weāre going to start with the intro to GDL.
Well, what is it? Well basically, a generative deep learning model, or generative modeling in general, is where weāre trying to create new examples of some training sets. So, you can see here down on the bottom left, we might have a training set of images. So from CelebA would be a typical example, and then in this video here on the bottom right, weāre seeing the output from a GAN, a generative adversarial network, called StyleGAN which is trying to create new examples of something that may have come from this training set.
So see how this isnāt a discriminative learning problem. Weāre not trying to label images here, weāre trying to create new ones altogether. Which in itself is a much harder problem as we will see. So, this field has been evolving over some time, but particularly in the last five, six years I would say since the invention of the GAN. The rate of progress is astonishing. This image on the right is an actual real image generated by a machine of something that might come from the Celeb A data set. From StyleGAN2, which was released by Nvidia just I think late last year. Late 2019.
And so you can see here that the rate of progress is really astonishing, and when we talk about generative modeling, the obviously comparison to make is with discriminative modeling as I said, where weāre trying to put an emotion on an image for example. So you would give the model this image, it would pass through a convolutional neural network, usually to produce in this case, five possible responses, which it could be shock, happiness, or anger. And these would be numbers that the model is outputting.
So that the whole point of doing this is that you want these numbers to be as accurate as possible, and thereās a well defined metric that you can use to tell. And that is because this data set is labeled. You know from the training sets that there are a certain number of images that are happy faces, there are a certain number that are angry and so on. With generative modeling, we donāt really have that luxury. Generative modeling works on unlabeled data. So you just have the images themselves, and the model has to work out what it is about those images that makes them belong to that set independently of never having seen anything outside the set.
So you can think of it as everything is being labeled as this is in the set, so find me something else that would belong to this set. Much more difficult problem. So mathematically, what weāre trying to do here is we have this unlabeled data set on the left hand side. Weāve just got two dimensions here just to play with the toy example. So X one, X two, and weāre trying to find the underlying distribution of this data set. So P of X. And what we want to do with this P of X is sample from it.
We want to be able to say given the distribution, and we all know how to sample from something like a normal distribution, but we want to sample from this P of X distribution to generate a new data point inside of the set. So it might say, āI think minus [knot] point six, minus knot point eight belongs to this set.ā Again, just to make the point, the discriminative model is not trying to do this. It is trying to predict P of Y given X. So Y is our response column. So the emotion of the image for example. So whatās the probability of a face being happy given this image.
And yeah, you can see here the differences in the training set is that the right hand set is labeled and the generative modeling data set is not labeled. So letās play with this toy example. So here we have a set of points that I have generated on the grid according to a rule. Does anyone want to hazard a guess what the rule is? First of all, has anyone seen this in the book? Because this is also in the book. Okay. Youāre not allowed to answer. Thatās great, one personās got the book. Awesome. Great.
So yes, anyone want to hazard a guess as to how this data set has been generated? Nope. Okay. And thatās no problem because itās quite difficult. So this is a generative model that we could built to sample other data points from this two dimension grid, and you would be perfectly reasonable to suggest this as a generative model, because we can sample from it. Very important for a generative model. We can pick a data point within this box and never pick one outside the box. So again, mathematically what weāre doing is putting a uniform distribution within the box, and outside the box is zero probability.
So thatās a generative model. But, as you can see here, this is the true data generative distribution. This is what we were trying to model, and obviously I knew this up front, because I am playing God here and saying, āYes, this is the true data generating distribution.ā But obviously in real life, you donāt know this. You donāt know what the true face generating distribution is. And you can see there are some instances where the model gets it very wrong.
So for example, like here, this is a point that just isnāt in the data generating distribution, but our model is estimating that it is. But equally, there are some points such as up in Alaska where the model says this is in the distribution, but our model, the one that we have produced, would never pick something in this top left hand corner. And what weāre trying to do with generative modeling is make these two match as closely as possible so that our model never produces something that the human eye would notice is outside the distribution.
So a face that looks not like a face, and equally, weāre trying to produce something that captures every kind of face. So it doesnāt just produce one certain male faces for example, but also female faces if they are also in the data set. And this is a really key point to remember, whatever generative modeling youāre doing, whether itās with text, images, or music, that this is ultimately the whole point of doing the exercise. Why is it difficult?
Well, the problem is that you have this huge high dimension data set in millions of dimensions, not just the two that weāve just seen, but the fact that youāve got maybe a thousand by a thousand pixels and for every single one of those pixels, you need an RGB value, so thatās another three that youāre multiplying by, and the fact is that we have to find the needle in a haystack here, because thereās so many of these that are going to be obviously not faces, and there are a tiny fraction that are.
So there are two problems that we come up against. Iāve mentioned the second one that the world generated observations are incredible sparse, but also thereās this complex dependency between features. And features here are the individuals pixels. So how does the model know, or how should it know that a pixel in the top right corner, if thatās green, because theyāre on a green background, that should be carried across to the other side of the image as well. So hereās the problem, weāve got this incredibly vast expanse of space in which to find true observations that were in the data set, and also even when we think weāve found on that looks decent, a human eye would tell actually the right eye is brown, and the left eye is blue.
So I know this isnāt true. So it needs to find this very complex dependence across pixels. And deep learning is really where weāve excelled recently, because this solves both of these problems, or at least goes a long way to solving as we saw with StyleGAN2. So weāre not going to start with GANs, weāre going to start with variational autoencoders. So hopefully by the end of this section, you will all be an expert as to how to build them, what theyāre trying to achieve, but more importantly actually, why.
Why weāre taking this approach to generative deep learning. So, I want you to look at this data set here. This is a data set of cylinders, obviously, and I want you as humans to think, āHow do I generalize what these cylinders are?ā Am I looking at the individual pixel values? Am I looking at the colors? What am I trying to do here when I look at this data set? And I would imagine that most of you have realized that there are two features that are important here. The two features are the height, and the width, which naturally is also the depth.
So, we want our model to do this as well. We want it to be able to look at this data set of cylinders and realize there is two dimensions in which they can be embedded, and those two dimensions are as you can see on the horizontal axis here, the width, and on the vertical axis, the height. And crucially, as we said earlier, not only are these able to be embedded into that space, but equally any point in that space equals a cylinder that we havenāt seen. And thatās what generative learningās all about.
Itās about finding these data points that we havenāt seen, but still belong to the same set. And what weāre doing as humans then is saying, well we can encode those two numbers, a height and a width into a picture of a cylinder, and given some cylinder, we can decode that cylinder into a two dimensional latent representation. And Iāll say that a lot of times in this talk, latent representation is basically this lower dimensional representation of what the image is.
So in this case, the latent space is two dimensions big. Height, and width. But in things like image face generation, it can be maybe two hundred dimensions big still. Is it okay if we take questions at the end actually? Just [inaudible] cool. Thanks. Okay. So this latent space, letās imagine itās two dimensions, so you can see here the cylinder on the left hand side being encoded into the latent space. We then pick up the point in the latent space, and we can decode those back into pixel space.
So what we need in this variational autoencoder is two models. We need an encoder, and we need a decoder. Both of this are neural networks. Iām not going to go into how you train neural networks in this talk, but the gist of it is you show it lots of examples it it back propagates any error through the network, and over time the weights adjust to make less error. And the error in this case is the difference between the image when it gets passed all the way through the network.
So, we take this pixel image here, we pass it all the way through the encoder, so itās just now a two dimensional vector, and we decode that vector back into pixel space and ask how similar are those two images? And if itās doing its job well, if the encoder is doing its job well, it should be encoding what this image is into two dimensions so that he decoder can be like, āOh, yeah, okay. I know what that is. Itās something that I decode to look like this.ā And if itās not doing its job well, then thereās a disconnect between these two things, and the model will train itself to be better over time.
So the loss point here might be something like root means square error between the two images. Where you can simply take the pixel values of individual pixels and ask whatās the [RMSE] between those two. And you want to get that as low as possible over time. So you train this over many epochs by showing it images from the data set, and you notice thereās no label here, because youāre just training it on the image itself. Itās learning just from the unlabeled data.
And what you get here, if you look on the right hand side, actually start on the left. So this is a data set that youāve probably seen before if youāve done any deep learning. Itās everywhere. Itās called the endless data set. And when you train a variational autoencoder on these, to produce things that look like numbers still, and you encode everything in the data set, you get something that looks like this. So itās in two dimensions, because weāre encoding into a two dimensional latent space, and every single point here is an image. And what Iāve done is colored these by the label that was in the endless data set.
We havenāt trained on that label, itās just being used to color the image, and you can see here quite cleverly what the encoder has done is try to separate out things that look the same. So, all of the ones are grouped together. All of the sixes are grouped together, all of the twos are grouped together. And itās doing that to give the decoder an easier job at the other end, because what itās got to do is take one of these points and try to reproduce the image.
But, there are two problems that this has because this is actually not a variational autoencoder, this is an autoencoder, and autoencoders have the problem that the sample space is really poorly defined, first of all. If you were to pick a new point in this space, am I allowed to pick 100 minus 50? And if not, why not? Thereās nothing stopping me from picking that point, and if I did pick that point in the latent space, am I guaranteed that itās going to decode to something sensible?
Autoencoders donāt really answer that question. And you can see here as well, if I just take three points at random in this space and ask the decoder to decode them, you get some fuzziness where thereās no real continuity between points in the space, and thatās because if you ask the decoder to go from this orange region here, and decode points to the blue region, is there anything really to tell it that it needs to move smoothly between those two points and generally merge, say, a one into a seven?
Not really, and the problem with this is we need it to be able to do this, because if weāre going to produce something like StyleGAN where itās merging between facial images, then we donāt want it to be discontinuous halfway through and produce complete noise because thatās a point that we may sample as a face. So we need this local continuity, number one, and we need the sample space to be well defined so that we can justifiably have a region into which we can sample, not have this problem of the infinite dimensions in either direction. Solution is the variational autoencoder. So this came in surprisingly recently.
All of these things are recent in the grand scheme of things, but this was one of the sparks that generated the revolution generative deep learning. What they realized was that actually if we include another term in the loss function called the [KL divergence] which our previous speakers already mentioned, actually. What this does, is it basically says instead of mapping an image to one single point in the latent space, what we do instead is map it to a normal distribution with a mean and a standard deviation in the latent space.
So you can think of this as a fuzzy region now thatās being mapped to in the latent space, and we want that to be as close to a standard normal as possible. Standard normal being zero mean, and standard deviation of one. And why do we want to do that? Well, firstly, it solves the sampling problem that we mentioned before. Now we can just sample from standard normals in order to generate new points. We donāt have this problem of the infinite space that we could possibly sample from. We have a well defined known distribution, the normal distribution, that we can sample from, when we push something through the decoder.
And secondly, it solves the local continuity problem, because now when the decoder samples a point, it could be anywhere in that region around the point that the encoder has pushed the point to. And thatās really important, because this fuzziness almost creates the necessity for the decoder to be good, not just decoding this point, but things around it need to also be decoded to something similar. And thatās really important, and that was the key really to variational autoencoders, becoming one of the first examples of things that got spookily good at things like generating faces.
And you can see here as I mentioned, the loss function is basically a sum now of the root mean squared error and the KL divergence. This KL divergence just says, āMake sure that things are very close to the standard normal distribution.ā And what do we get now if we look at this scatter plot of the encoded points? Well, everything is around the zero one standard normal, which is exactly what we want. Everything has been crushed into near the origin, so that you can sample from this two dimensional normal distribution, and get something that is very close to what a digit might look like.
And you can move a little bit to the left, and it will produce something thatās a little bit different, not completely different, which is exactly what we want. And you can see how this is densely populated, if we take the Z values and P values, this is very densely populated, and we donāt have these vast areas of white space that we had before. So, this is one example of how a variational autoencoder might be used to generate digits. Letās just take a look actually.
If we imagine my finger is moving around the latent space, this is what the decoder is decoding points to, and you can imagine actually some of these might look a bit weird, but imagine our numeric system developed differently, then we may have some of these symbols appearing as digits. They are viable digit looking things, even though theyāre not actual numbers. And thatās whatās important. We donāt have these weird disconnects between points in the pixel space for example. Cool.
So letās move on from digits, because itās a very boring data set. We can look at images instead. We can trail exactly the same model on the Celeb A data set. This is something you can do in the book, all of the examples are there for you to build these things at home. This is just a couple of epochs of training. It gets better obviously over time, but I just wanted to show you how quickly you can get set up with this sort of thing and build things that actually look fairly impressive on your laptop. So yeah, these are all examples of faces that have been generated, some obviously better than others, but these are just completely picked at random.
And these are where I just sample a point in the normal distribution, so pick a point somewhere around the origin, decode that, and you get a face back. And if you change the point, you get a different face, and itās quite fun to play around with. One of the other fun things you can do, with various autoencoders is you can do latent space arithmetic. Thereās a few things you can do. You can add and subtract properties for one thing.
So you take a point by encoding an image into the latent space, so if weāre in two dimensions, weāve got a two dimensional vector, and we calculate the sunglasses vector, say, which is where we take every image in our data set with sunglasses on, and we take every image without sunglasses on, and subtract those two in the latent space. So you now have a vector along which you can travel to add sunglasses to any image. So letās say this is the vector that was calculated. What it will do, is it will take the encoded image here, add sunglasses to it, and then you can decode that image into someone that has sunglasses on.
And you can see Iāve calculated here some other vectors. The smiling vector, the blonde vector, the male vector, and the sunglasses vector, and just by moving along this vector more and more, you can change the input image along a certain property. And again, you donāt even need huge amounts of computer to build something like this. This is all just done on my laptop with CPUs and stuff, but obviously GPUs would speed up the training. But, the underlying technology is exactly the same. Weāre just doing good old fashioned arithmetic in some sample space. And you get amazing things like this. So another thing you can do is you can interpolate between two images, so exactly the same idea.
Encode two images into the latent space, and you have some alpha along this vector that you travel between zero and one, decode halfway along that point, and you get a merge between the two images. So, you could merge Prince Charlesā wife over there, and a lady on the left hand side, and you blend the two images so that one slowly morphs into the other. Thatās quite fun, and then the last thing you can do here is you can build a model to generate lego heads, which somebody did the other day. This is a blog post from some chap, but please do check it out. It looks really fun, and a good example with a really nice data set on lego heads.
You can basically train a variational autoencoder to do exactly this and build your own examples of what a new lego head might look like. Right, okay. So that is variational autoencoders. Itās a great way to get started with generative deep learning. The thing that most people think about when they think of GDL is GANs. Itās probably the best example these days of how to build really sophisticated generative deep learning models, especially when images are concerned. And weāre going to take a little look about how they different from VAEs, because theyāre not actually that different. So weāre going to start with the VAE, as weāve just seen, this is exactly the same architecture.
And weāll ask the question, what if we remove the encoder and weāre going to build a separate model to predict if the decoder is generating things that are real or fake. Sorry, to predict if a given image is real or fake, I should say. So itās fake when itās generated by the decoder, and itās real if it comes from our data set. So, that is what weāre doing here. Iām going to play that again. It took me so long to build that slide transition, itās worth a second look. So, here this is the encoder.
Ā
Speaker 2:
[crosstalk] VAE
Ā
David Foster:
Should have done, yeah. Youāre right. Itās only a few years away for me. So you can see here the encoder, weāre actually taking this and converting it into what is called in the GAN framework, a discriminator. Because what itās job is now is not to convert something into a later representation, but itās to output a single number, which is how real it believes this image is. So is it something from the training set, or is it something that the decoder has produced.
The generator is identical to the decoder in a VAE, so thereās very little between them really, and people always think of GANs as really difficult and VAEs as the simple younger brother, but itās not that case at all. Theyāre actually both quite simple models at heart, itās just about how you think about them, and how you really understand what theyāre doing. So this is a GAN. The real difference really with VAEs is that the VAE model is connected in the middle.
Thereās a latent space right in the middle there that data points get passed through, images get passed through, whereas the GAN is two separate networks that you have to train independently. So letās see how we do that. So letās talk about training the discriminator first. And remember itās job is to discriminate between real and fake images. So what we do is we generate a batch of images from the decoder, from the generator I should say in this framework, and of course these to begin with are going to be awful, because it has no idea how to generate faces.
But weāre seeing here, maybe three quarters of the way through the training process, where itās generating something thatās pretty good, and weāre going to mix them with some real images in the training set that we have from the Celeb A training set. And then we pass them all through the discriminator and ask it what do you think is real and what do you think is fake? So thereās its predictions, and thereās the target that is actually the ground truth. This is just like regular machine learning, discriminative modeling.
And the loss there is something like cross entropy loss where we take the prediction and we take the target, and we have a metric that says how close are those two things together. So we donāt usually use something like root mean squared error here, because this is a binary response column. Binary cross entropy is a good choice for the loss metric. So you can see there, itās done pretty well at the top, because itās predicted this one is not very real, but itās got this one really wrong, and it says actually I thought this was a real face, because it predicted at knot .89, but it was actually zero, so itās a high loss.
And you do that again and again, and you train a discriminator. Thatās how that works. Training the generator is a little bit more tricky, because what we have to do is first of all generate a batch of images, and then pass this through the discriminator to get out this number here, or what the discriminator thinks these generated images look like. But now, notice that instead of a target of zero, which is what we were using for the training of the discriminator, because we wanted it to spot these were fake, we actually now say, the target here is one.
Because we want this generator to generate things that are more likely to be true images. So when we back propagate these errors, we must freeze the weights of the discriminator. Because otherwise the discriminator is going to start training against this target variable, and we donāt want it to do that. We want it to train on its training process. So this is just for training the generator. But we need to use the discriminator as a mechanism for generating numbers that the generator can be trained on.
So, these are the two training processes, and what we do is we just iterate them. So you train the discriminator for a bit, then you train the generator for a bit. Then the discriminator, then the generator, and you literally just play them off against each other. And you say the discriminator, okay, get better at discriminating these terrible images from these real ones. And it gets the hang of that, and then you tell the generator, well do a better job. Because the discriminator is now pretty good at spotting your mistakes.
So then the generator gets better, and then discriminator has a harder time. Itās like two adversaries, which is where the name comes from, playing off against each other, and slowly getting better over time. And thatās it. Thatās all there is to training generative adversarial networks. Thereās obviously a number of ways that the training process has been enhanced and developed, and improved. Both in terms of the stability and the speed at which theyāre trained. And thereās lot of information in the book actually about things like the Wasser-stein GAN, which was one of the first ways that the stability was improved, and WP GAN which is now the state of the art way of training a GAN.
And you can do similar things that you can do with VAEs. So you can use them to generate new faces, you can see the quality of the faces here is incredible really. This isnāt even state of the art these days. This is just StyleGAN, so not even the second variation of it. So you can do that as you could before. You generally get sharper images with GANs than you do with VAEs. They tend to be a bit blurry. Thereās a number of reasons for that. But the other thing you can do, I havenāt got it here, but obviously you can do the same playing around in the latent space that you could with variational autoencoders as well.
So, Iād encourage you to play around with it and just find out for yourself what these things can do. Okay, I want to very quickly now cover a topic called world models. So weāre going to go back to the VAEs that we saw earlier, because we want to ask the question well, can we use these in another context other than just generating cool faces, because itās nice, but how can we actually use this in a practical sense. The answer is yes. So you can see here, this is something called reinforcement learning, where weāre trying to train an agent to achieve a task in an environment. And what happens is the agent performs actions, so in this case it might be driving left right or pressing the accelerator, and we give that action to the environment.
The environment then says, okay, Iām going to give you a reward if you do something well, and Iām going to give you what the new state is that you have to use to produce your next action. And so the loop continues. And thereās obviously tons of techniques out there for training reinforcement, learning algorithms. But what was impressive is how they applied generative deep learning to a reinforcement learning setting. So, Iām going to take you through this diagram. This is basically a very stripped down version of what they did.
You can see here, this is the same diagram that weāve just seen where the environment gives a state back to the model, and the model is ultimately giving an action back to the environment, but right in the middle here is a variational autoencoder. And what this is doing, is itās mapping this image into a latent space, and then the world model, itās job is to understand how that latent space evolves over time. And itās much easier to understand how the latent space evolves than it is the pixel space, because the pixel space is full of noise.
Itās full of things like this barrier here that really donāt do anything in the environment. Itās full of these green pixels which actually mean nothing to this car. All it cares about is the road immediately in front of it. So, what they realized is they can built a latent space that ultimately the world model learns to model over time, and understand how it might evolve given an action that it has just taken, which is what this dotted line is here at the bottom. And then they use some other techniques.
They used evolutionary algorithms to build this controller over here, which then says well, given what the world model is telling me, how should I translate that into an action? So thatās all very good. Just to give you an idea about what this is doing, this is the output from the variational autoencoder once its been trained. So you map here into, I think they used 32 dimensions, and then re map back into the pixel space, and you can see here the two are very closely aligned, which means that with just 32 numbers, they could tell you pretty much exactly what the state of the environment was.
As opposed to the thousands of numbers that this pixel space represents, because itās in a huge grid, and thereās three RGB channels for example. And the other thing you can do obviously is play around with this latent space like we saw before. So you take the Z, which is the hidden representation, tweak some of the variables, and you can create new tracks that may exist in the world. So you can see here just playing around with a few bits, you can create bends. So whatās important to understand here is that the model understands what is and isnāt possible in this world in terms of states that it might come up against.
So all it needs to do now is understand how the Z evolves over time. It doesnāt really need to know anything about what he environment looks like. Itās whole understanding of the world is in this latent space. So the agent learns a world model of how the latent space evolves given an action, but the crucial step, and this is where the paper made progress on what had been done before, is that they realized that actually you could replace the entire environment with a copy of the trained world model, and then train the environment within its own dream of how the environment evolves, rather than actually asking the environment what happened.
Thatās the real magic of this. This is an example here of where, youāve used the environment to get a feel of the physics and what happens, but not to train it on any specific task. So when youāre training the world model, youāre basically saying, just play around and see how things evolve, much like a child might do. Knocks into stuff, realizes that certain things cause it to go off the track, certain things cause it to spin around in a circle and not really get very far, and so on.
Once it has that world model understanding, you can then say, now train yourself to go as fast as possible around the track. And itās like, āOh yeah. Okay, I understand. Because now if I press the accelerator, I know that I go forwards. I donāt need the environment anymore to give me feedback, because Iāve learnt that.ā So you can actually just replace the entire environment with its own world model. So this is what it means by learning in its own dreams. This is actually learning here, so once youāve trained it, you can see here this is what it might imagine is coming up in the future, and itās taking action.
So you can see there, it turned left. None of this exists. Weāre not using the environment for any of these reconstructions. Itās taking this Z, imagining how it evolves over time, and just throwing this back out through the decoder so that us as humans can see whatās going on. It doesnāt need any of these images, itās just understanding how the latent Z evolves and giving itself rewards as well, and thatās really important. It learns not only how the environment evolves, it also learns what reward would I get if I did this? Because itās learned that from the environment, too. And amazingly, this is not just as good as stuff thatās come before, itās better.
If you learn in the dream, you achieve state of the art much quicker and in fact even better than has previously been achieved. So this is the result straight from the paper. Quite amazing that it achieves almost a perfect score having learnt never in the environment, but only in its own dreams about how to take on this task. Now, why does this help AGI? Artificial general intelligence. So, letās just imagine what this might mean for the future, and why this is important.
First of all, I think whatās important is to start from ground principles and ask, what are we doing at the moment in reinforcement learning? What weāre doing is weāre generally asking the question, how do rewards that the environment gives back convert into actions that I should do? And in supervised learning, another massive field of machine learning, weāre asking the question, how do labels that I give the model, how should I convert those into predictions of things I havenāt seen before? But, and this is what Iāve been thinking about, and I encourage you to as well, is do any of the things really exist in the world that we put an agent into?
Or is it something we have to give it? And the more I think about this, the more I think actually, the idea of labels and actions and rewards in an environment, they donāt exist. Nature isnāt so kind as to tell us what we need to do, and thatās profound to think about things like that. But, when you start thinking about it, you realize more and more that the only thing that the agent has to work with is this field of data that it exists within, and in my head, the only thing that I could think of that an agent can do, is to imagine how that data might evolve over time. And perhaps if agents get good at that, then they naturally are better at doing things in the environment that make them stay around longer.
And thatās the key idea here is that instead of giving an agent a task that we as the human overlords give it to do, we need to find a way of building agents and environments that are free for the agent to explore and explore generatively and understand how this environment evolves over time given things that it might do in that environment. And I think that what would happen if you did that is that over time, with a powerful enough agent, is that it would realize that it exists in the environment, and that thereās a direct correlation between things that itās seeing itself do, and things that happen.
And that should come about not through us telling it thatās what it needs to look for, but because thatās something that it finds intuitively that it needs to explore more about, because itās something that it doesnāt understand now. So this idea that agents need to be inquisitive is something thatās come into the literature a lot recently, and that actually training agents to simply be inquisitive is enough for them to start displaying intelligence. And that this is an example of three of the ways that generative models is starting to creep not only into machine learning literature, but also into psychology literature.
The idea that the brain may just be one big, whether itās generative adversarial network, or whether itās some form of generative model thatās just looking at things and trying to imagine what might happen into the future, because thatās what it needs to be good at, is something that lots of people are exploring now. Itās creeping more and more into peopleās ideas about what machine learning should really be trying to achieve, and we shouldnāt be putting too much structure on that. So, what might this mean and how does this work in practice, again, this is my own thoughts, and feel free to have a thing about this on the tube on the way home, and itās quite fun to think about how this might happen.
Well, what if we could train generative models that the agent builds up about its environment just like the car example that we saw before? But instead of its sole goal being to solve just one task, that it creates its own tasks to solve. And that would mean it has to create a rewards system. So a function that takes things like the state and converts them into a reward that it wants to learn, not because the environment has told it to learn that, or we have told it, but it wants to.
And also, to understand what it wants to understand is an action, what it wants to understand is a label. I think thatās the way that machine learning needs to go, and if weāre going to build something thatās truly intelligent at multiple tasks. Obviously, this is a field thatās very nascent and naĆÆve at the moment, but I think we need to start thinking about this as data scientists if we really want to progress AGI. I just want to finish up with this video. Have a look at this agent in this environment and ask yourself, where is the reward coming from? Okay. So all its done is just put a block on top of another block, and there seems to be some reward function there thatās kicking in quite spectacularly.
Ā
Speaker 2:
Dopamine.
Ā
David Foster:
Yeah. Well, exactly. Dopamine is [crosstalk] Yeah.
Ā
Speaker 2:
Then eventually, that [inaudible] go and maybe stack something up or something.
Ā
David Foster:
Yeah, and I think a really interesting point on that is how does it control dopamine to achieve what it wants to achieve? Rather than the environment telling it how to use its own dopamine if you like. So, thereās obviously some sort of overdrive of enjoyment here, just from [crosstalk]
Ā
Speaker 2:
Possibly happen is that [crosstalk]
Ā
David Foster:
We need to build things that are just intuitively know that it needs to explore the world more in order to create situations like this. A great way to summarize it, Richard Feynman, one of my own personal heroes, said, āWhat I cannot create, I do not understand.ā I think equally is true is what AI cannot create, AI doesnāt understand. And thatās the end of my talk. Thank you very much.