Generative Deep Learning - The Key To Unlocking Artificial General Intelligence Meetup #LondonAI
This meetup video was recorded in London on February 28th, 2020.
Generative modeling is one of the hottest topics in AI. It’s now possible to teach a machine to excel at human endeavors such as painting, writing, and composing music. In this talk, we will cover:
- A general introduction to Generative Modelling
- A walkthrough of one of the most utilized generative deep learning models
- the Variational Autoencoder (VAE)
- Examples of state-of-the-art output from Generative Adversarial Networks (GANs) and Transformer based architectures.
- How generative models can be used in a reinforcement learning setting (World Models paper)
- Why I believe generative models will play a crucial part in the quest to build Artificial General Intelligence (AGI)
David Foster is a Founding Partner of Applied Data Science Partners (https://adsp.ai/), a data science consultancy building innovative AI solutions for clients. He holds an MA in Mathematics from Trinity College, Cambridge, UK and an MSc in Operational Research from the University of Warwick. David has won several international machine learning competitions and is the author of the best-selling book ‘Generative Deep Learning: Teaching Machines to Paint, Write, Compose and Play’. He has also authored several successful blog posts on deep reinforcement learning including ‘How To Build Your Own AlphaZero AI using Keras’.
Read the Full Transcript
My name is David Foster. I am the author of a book called Generative Deep Learning. I decided about a year and a half ago that I wanted to write a book about something I felt was going to be absolutely huge in the next year, two years, three years. And that has certainly come true. It’s in the public eye more than ever now before. I think with things like deep fakes, with things like [GBT2], the model that was released by open AI to do text generation, and what that could mean for things like fake news.
It really is more prominent than ever, and we have a couple of copies here this evening, so get thinking of questions. There will be a giveaway at the end for the best question. It’s a first edition. They’re all first editions because I haven’t run a second edition yet, but if you put it on Ebay, you get more if you say it’s the first edition. Cool. So, Generative Deep Learning.
This book really came out of the desire really to write about something that I think most of us as data scientists, if that’s what you do as your profession, you don’t get the chance perhaps to build generative deep learning models as part of your daily work. You may build machine learning models for sure. But things like deep learning and generative deep learning are hobbyist topics at the moment. And I think in years to come, that may change, but I wanted to write a book that was just about the pure fascination of building things that make machines human.
So, whether that’s text generation, whether that’s image generation, or music generation, this is a subject that just captures the imagination. When I was writing this book, I’m the co founder of a company called Applied Data Science. We’re a London based data science consultancy. Please do visit our website. We’re ADSP.AI, applied data science partners. You’ll see there some of the case studies of things that we do. I know you’re all on the wifi now, so do check it out.
And we basically, we build bespoke data science solutions for companies. So we’re not a platform, we just hire fantastic data scientists, data engineers and data analysts, and we build bespoke machine learning and data science solutions for companies. So we are hiring as well, so if any of you are looking for a move, please do check out our website even more for those job specs. So the subject of this talk really, we’re going to go on a real expedition through generative deep learning, right the way from first principles.
So we don’t assume any knowledge of what generative deep learning is, all the way through to cutting edge GDL, generative deep learning today. And we’ll cover these five topics with hyphen zero indexing there. So, intro to generative deep learning, we’ll just cover what we’re trying to achieve by doing this. We’ll then cover something called a variational autoencoder, which I think is a really great entry point into this subject. It’s very easy to get bogged down by [GANs] quite quickly if that’s the first thing you come across with generative deep learning. I’d recommend if you’re getting started on this, start with [VAEs].
We will cover GANS as well in this talk. Before moving on to something a bit more speculative, we’re looking at something called the World Models Paper by David Ha and Jurgen Schmidhuber that came out in 2018, and this is a fascinating example of not just using generative deep learning for creation of something, but actually using it within a reinforcement learning setting. And they showed that actually it was possible for a machine to learn within its own dreams of what the environment might do in the future.
And that involved actually a variational autoencoder to do that. So we’ll take a look at that. I want to finish with something that’s going to get you thinking on the way home on the tube about what this might mean for artificial general intelligence. It’s a subject everyone talks a lot about, but it’s so intangible, we don’t really know what this means at the moment. But I think generative deep learning is something that we can get a grasp of, and it’s something that I think will form the basis of our endeavors into this in the future. And so we’ll just talk speculatively about what that might mean. So we’re going to start with the intro to GDL.
Well, what is it? Well basically, a generative deep learning model, or generative modeling in general, is where we’re trying to create new examples of some training sets. So, you can see here down on the bottom left, we might have a training set of images. So from CelebA would be a typical example, and then in this video here on the bottom right, we’re seeing the output from a GAN, a generative adversarial network, called StyleGAN which is trying to create new examples of something that may have come from this training set.
So see how this isn’t a discriminative learning problem. We’re not trying to label images here, we’re trying to create new ones altogether. Which in itself is a much harder problem as we will see. So, this field has been evolving over some time, but particularly in the last five, six years I would say since the invention of the GAN. The rate of progress is astonishing. This image on the right is an actual real image generated by a machine of something that might come from the Celeb A data set. From StyleGAN2, which was released by Nvidia just I think late last year. Late 2019.
And so you can see here that the rate of progress is really astonishing, and when we talk about generative modeling, the obviously comparison to make is with discriminative modeling as I said, where we’re trying to put an emotion on an image for example. So you would give the model this image, it would pass through a convolutional neural network, usually to produce in this case, five possible responses, which it could be shock, happiness, or anger. And these would be numbers that the model is outputting.
So that the whole point of doing this is that you want these numbers to be as accurate as possible, and there’s a well defined metric that you can use to tell. And that is because this data set is labeled. You know from the training sets that there are a certain number of images that are happy faces, there are a certain number that are angry and so on. With generative modeling, we don’t really have that luxury. Generative modeling works on unlabeled data. So you just have the images themselves, and the model has to work out what it is about those images that makes them belong to that set independently of never having seen anything outside the set.
So you can think of it as everything is being labeled as this is in the set, so find me something else that would belong to this set. Much more difficult problem. So mathematically, what we’re trying to do here is we have this unlabeled data set on the left hand side. We’ve just got two dimensions here just to play with the toy example. So X one, X two, and we’re trying to find the underlying distribution of this data set. So P of X. And what we want to do with this P of X is sample from it.
We want to be able to say given the distribution, and we all know how to sample from something like a normal distribution, but we want to sample from this P of X distribution to generate a new data point inside of the set. So it might say, “I think minus [knot] point six, minus knot point eight belongs to this set.” Again, just to make the point, the discriminative model is not trying to do this. It is trying to predict P of Y given X. So Y is our response column. So the emotion of the image for example. So what’s the probability of a face being happy given this image.
And yeah, you can see here the differences in the training set is that the right hand set is labeled and the generative modeling data set is not labeled. So let’s play with this toy example. So here we have a set of points that I have generated on the grid according to a rule. Does anyone want to hazard a guess what the rule is? First of all, has anyone seen this in the book? Because this is also in the book. Okay. You’re not allowed to answer. That’s great, one person’s got the book. Awesome. Great.
So yes, anyone want to hazard a guess as to how this data set has been generated? Nope. Okay. And that’s no problem because it’s quite difficult. So this is a generative model that we could built to sample other data points from this two dimension grid, and you would be perfectly reasonable to suggest this as a generative model, because we can sample from it. Very important for a generative model. We can pick a data point within this box and never pick one outside the box. So again, mathematically what we’re doing is putting a uniform distribution within the box, and outside the box is zero probability.
So that’s a generative model. But, as you can see here, this is the true data generative distribution. This is what we were trying to model, and obviously I knew this up front, because I am playing God here and saying, “Yes, this is the true data generating distribution.” But obviously in real life, you don’t know this. You don’t know what the true face generating distribution is. And you can see there are some instances where the model gets it very wrong.
So for example, like here, this is a point that just isn’t in the data generating distribution, but our model is estimating that it is. But equally, there are some points such as up in Alaska where the model says this is in the distribution, but our model, the one that we have produced, would never pick something in this top left hand corner. And what we’re trying to do with generative modeling is make these two match as closely as possible so that our model never produces something that the human eye would notice is outside the distribution.
So a face that looks not like a face, and equally, we’re trying to produce something that captures every kind of face. So it doesn’t just produce one certain male faces for example, but also female faces if they are also in the data set. And this is a really key point to remember, whatever generative modeling you’re doing, whether it’s with text, images, or music, that this is ultimately the whole point of doing the exercise. Why is it difficult?
Well, the problem is that you have this huge high dimension data set in millions of dimensions, not just the two that we’ve just seen, but the fact that you’ve got maybe a thousand by a thousand pixels and for every single one of those pixels, you need an RGB value, so that’s another three that you’re multiplying by, and the fact is that we have to find the needle in a haystack here, because there’s so many of these that are going to be obviously not faces, and there are a tiny fraction that are.
So there are two problems that we come up against. I’ve mentioned the second one that the world generated observations are incredible sparse, but also there’s this complex dependency between features. And features here are the individuals pixels. So how does the model know, or how should it know that a pixel in the top right corner, if that’s green, because they’re on a green background, that should be carried across to the other side of the image as well. So here’s the problem, we’ve got this incredibly vast expanse of space in which to find true observations that were in the data set, and also even when we think we’ve found on that looks decent, a human eye would tell actually the right eye is brown, and the left eye is blue.
So I know this isn’t true. So it needs to find this very complex dependence across pixels. And deep learning is really where we’ve excelled recently, because this solves both of these problems, or at least goes a long way to solving as we saw with StyleGAN2. So we’re not going to start with GANs, we’re going to start with variational autoencoders. So hopefully by the end of this section, you will all be an expert as to how to build them, what they’re trying to achieve, but more importantly actually, why.
Why we’re taking this approach to generative deep learning. So, I want you to look at this data set here. This is a data set of cylinders, obviously, and I want you as humans to think, “How do I generalize what these cylinders are?” Am I looking at the individual pixel values? Am I looking at the colors? What am I trying to do here when I look at this data set? And I would imagine that most of you have realized that there are two features that are important here. The two features are the height, and the width, which naturally is also the depth.
So, we want our model to do this as well. We want it to be able to look at this data set of cylinders and realize there is two dimensions in which they can be embedded, and those two dimensions are as you can see on the horizontal axis here, the width, and on the vertical axis, the height. And crucially, as we said earlier, not only are these able to be embedded into that space, but equally any point in that space equals a cylinder that we haven’t seen. And that’s what generative learning’s all about.
It’s about finding these data points that we haven’t seen, but still belong to the same set. And what we’re doing as humans then is saying, well we can encode those two numbers, a height and a width into a picture of a cylinder, and given some cylinder, we can decode that cylinder into a two dimensional latent representation. And I’ll say that a lot of times in this talk, latent representation is basically this lower dimensional representation of what the image is.
So in this case, the latent space is two dimensions big. Height, and width. But in things like image face generation, it can be maybe two hundred dimensions big still. Is it okay if we take questions at the end actually? Just [inaudible] cool. Thanks. Okay. So this latent space, let’s imagine it’s two dimensions, so you can see here the cylinder on the left hand side being encoded into the latent space. We then pick up the point in the latent space, and we can decode those back into pixel space.
So what we need in this variational autoencoder is two models. We need an encoder, and we need a decoder. Both of this are neural networks. I’m not going to go into how you train neural networks in this talk, but the gist of it is you show it lots of examples it it back propagates any error through the network, and over time the weights adjust to make less error. And the error in this case is the difference between the image when it gets passed all the way through the network.
So, we take this pixel image here, we pass it all the way through the encoder, so it’s just now a two dimensional vector, and we decode that vector back into pixel space and ask how similar are those two images? And if it’s doing its job well, if the encoder is doing its job well, it should be encoding what this image is into two dimensions so that he decoder can be like, “Oh, yeah, okay. I know what that is. It’s something that I decode to look like this.” And if it’s not doing its job well, then there’s a disconnect between these two things, and the model will train itself to be better over time.
So the loss point here might be something like root means square error between the two images. Where you can simply take the pixel values of individual pixels and ask what’s the [RMSE] between those two. And you want to get that as low as possible over time. So you train this over many epochs by showing it images from the data set, and you notice there’s no label here, because you’re just training it on the image itself. It’s learning just from the unlabeled data.
And what you get here, if you look on the right hand side, actually start on the left. So this is a data set that you’ve probably seen before if you’ve done any deep learning. It’s everywhere. It’s called the endless data set. And when you train a variational autoencoder on these, to produce things that look like numbers still, and you encode everything in the data set, you get something that looks like this. So it’s in two dimensions, because we’re encoding into a two dimensional latent space, and every single point here is an image. And what I’ve done is colored these by the label that was in the endless data set.
We haven’t trained on that label, it’s just being used to color the image, and you can see here quite cleverly what the encoder has done is try to separate out things that look the same. So, all of the ones are grouped together. All of the sixes are grouped together, all of the twos are grouped together. And it’s doing that to give the decoder an easier job at the other end, because what it’s got to do is take one of these points and try to reproduce the image.
But, there are two problems that this has because this is actually not a variational autoencoder, this is an autoencoder, and autoencoders have the problem that the sample space is really poorly defined, first of all. If you were to pick a new point in this space, am I allowed to pick 100 minus 50? And if not, why not? There’s nothing stopping me from picking that point, and if I did pick that point in the latent space, am I guaranteed that it’s going to decode to something sensible?
Autoencoders don’t really answer that question. And you can see here as well, if I just take three points at random in this space and ask the decoder to decode them, you get some fuzziness where there’s no real continuity between points in the space, and that’s because if you ask the decoder to go from this orange region here, and decode points to the blue region, is there anything really to tell it that it needs to move smoothly between those two points and generally merge, say, a one into a seven?
Not really, and the problem with this is we need it to be able to do this, because if we’re going to produce something like StyleGAN where it’s merging between facial images, then we don’t want it to be discontinuous halfway through and produce complete noise because that’s a point that we may sample as a face. So we need this local continuity, number one, and we need the sample space to be well defined so that we can justifiably have a region into which we can sample, not have this problem of the infinite dimensions in either direction. Solution is the variational autoencoder. So this came in surprisingly recently.
All of these things are recent in the grand scheme of things, but this was one of the sparks that generated the revolution generative deep learning. What they realized was that actually if we include another term in the loss function called the [KL divergence] which our previous speakers already mentioned, actually. What this does, is it basically says instead of mapping an image to one single point in the latent space, what we do instead is map it to a normal distribution with a mean and a standard deviation in the latent space.
So you can think of this as a fuzzy region now that’s being mapped to in the latent space, and we want that to be as close to a standard normal as possible. Standard normal being zero mean, and standard deviation of one. And why do we want to do that? Well, firstly, it solves the sampling problem that we mentioned before. Now we can just sample from standard normals in order to generate new points. We don’t have this problem of the infinite space that we could possibly sample from. We have a well defined known distribution, the normal distribution, that we can sample from, when we push something through the decoder.
And secondly, it solves the local continuity problem, because now when the decoder samples a point, it could be anywhere in that region around the point that the encoder has pushed the point to. And that’s really important, because this fuzziness almost creates the necessity for the decoder to be good, not just decoding this point, but things around it need to also be decoded to something similar. And that’s really important, and that was the key really to variational autoencoders, becoming one of the first examples of things that got spookily good at things like generating faces.
And you can see here as I mentioned, the loss function is basically a sum now of the root mean squared error and the KL divergence. This KL divergence just says, “Make sure that things are very close to the standard normal distribution.” And what do we get now if we look at this scatter plot of the encoded points? Well, everything is around the zero one standard normal, which is exactly what we want. Everything has been crushed into near the origin, so that you can sample from this two dimensional normal distribution, and get something that is very close to what a digit might look like.
And you can move a little bit to the left, and it will produce something that’s a little bit different, not completely different, which is exactly what we want. And you can see how this is densely populated, if we take the Z values and P values, this is very densely populated, and we don’t have these vast areas of white space that we had before. So, this is one example of how a variational autoencoder might be used to generate digits. Let’s just take a look actually.
If we imagine my finger is moving around the latent space, this is what the decoder is decoding points to, and you can imagine actually some of these might look a bit weird, but imagine our numeric system developed differently, then we may have some of these symbols appearing as digits. They are viable digit looking things, even though they’re not actual numbers. And that’s what’s important. We don’t have these weird disconnects between points in the pixel space for example. Cool.
So let’s move on from digits, because it’s a very boring data set. We can look at images instead. We can trail exactly the same model on the Celeb A data set. This is something you can do in the book, all of the examples are there for you to build these things at home. This is just a couple of epochs of training. It gets better obviously over time, but I just wanted to show you how quickly you can get set up with this sort of thing and build things that actually look fairly impressive on your laptop. So yeah, these are all examples of faces that have been generated, some obviously better than others, but these are just completely picked at random.
And these are where I just sample a point in the normal distribution, so pick a point somewhere around the origin, decode that, and you get a face back. And if you change the point, you get a different face, and it’s quite fun to play around with. One of the other fun things you can do, with various autoencoders is you can do latent space arithmetic. There’s a few things you can do. You can add and subtract properties for one thing.
So you take a point by encoding an image into the latent space, so if we’re in two dimensions, we’ve got a two dimensional vector, and we calculate the sunglasses vector, say, which is where we take every image in our data set with sunglasses on, and we take every image without sunglasses on, and subtract those two in the latent space. So you now have a vector along which you can travel to add sunglasses to any image. So let’s say this is the vector that was calculated. What it will do, is it will take the encoded image here, add sunglasses to it, and then you can decode that image into someone that has sunglasses on.
And you can see I’ve calculated here some other vectors. The smiling vector, the blonde vector, the male vector, and the sunglasses vector, and just by moving along this vector more and more, you can change the input image along a certain property. And again, you don’t even need huge amounts of computer to build something like this. This is all just done on my laptop with CPUs and stuff, but obviously GPUs would speed up the training. But, the underlying technology is exactly the same. We’re just doing good old fashioned arithmetic in some sample space. And you get amazing things like this. So another thing you can do is you can interpolate between two images, so exactly the same idea.
Encode two images into the latent space, and you have some alpha along this vector that you travel between zero and one, decode halfway along that point, and you get a merge between the two images. So, you could merge Prince Charles’ wife over there, and a lady on the left hand side, and you blend the two images so that one slowly morphs into the other. That’s quite fun, and then the last thing you can do here is you can build a model to generate lego heads, which somebody did the other day. This is a blog post from some chap, but please do check it out. It looks really fun, and a good example with a really nice data set on lego heads.
You can basically train a variational autoencoder to do exactly this and build your own examples of what a new lego head might look like. Right, okay. So that is variational autoencoders. It’s a great way to get started with generative deep learning. The thing that most people think about when they think of GDL is GANs. It’s probably the best example these days of how to build really sophisticated generative deep learning models, especially when images are concerned. And we’re going to take a little look about how they different from VAEs, because they’re not actually that different. So we’re going to start with the VAE, as we’ve just seen, this is exactly the same architecture.
And we’ll ask the question, what if we remove the encoder and we’re going to build a separate model to predict if the decoder is generating things that are real or fake. Sorry, to predict if a given image is real or fake, I should say. So it’s fake when it’s generated by the decoder, and it’s real if it comes from our data set. So, that is what we’re doing here. I’m going to play that again. It took me so long to build that slide transition, it’s worth a second look. So, here this is the encoder.
Should have done, yeah. You’re right. It’s only a few years away for me. So you can see here the encoder, we’re actually taking this and converting it into what is called in the GAN framework, a discriminator. Because what it’s job is now is not to convert something into a later representation, but it’s to output a single number, which is how real it believes this image is. So is it something from the training set, or is it something that the decoder has produced.
The generator is identical to the decoder in a VAE, so there’s very little between them really, and people always think of GANs as really difficult and VAEs as the simple younger brother, but it’s not that case at all. They’re actually both quite simple models at heart, it’s just about how you think about them, and how you really understand what they’re doing. So this is a GAN. The real difference really with VAEs is that the VAE model is connected in the middle.
There’s a latent space right in the middle there that data points get passed through, images get passed through, whereas the GAN is two separate networks that you have to train independently. So let’s see how we do that. So let’s talk about training the discriminator first. And remember it’s job is to discriminate between real and fake images. So what we do is we generate a batch of images from the decoder, from the generator I should say in this framework, and of course these to begin with are going to be awful, because it has no idea how to generate faces.
But we’re seeing here, maybe three quarters of the way through the training process, where it’s generating something that’s pretty good, and we’re going to mix them with some real images in the training set that we have from the Celeb A training set. And then we pass them all through the discriminator and ask it what do you think is real and what do you think is fake? So there’s its predictions, and there’s the target that is actually the ground truth. This is just like regular machine learning, discriminative modeling.
And the loss there is something like cross entropy loss where we take the prediction and we take the target, and we have a metric that says how close are those two things together. So we don’t usually use something like root mean squared error here, because this is a binary response column. Binary cross entropy is a good choice for the loss metric. So you can see there, it’s done pretty well at the top, because it’s predicted this one is not very real, but it’s got this one really wrong, and it says actually I thought this was a real face, because it predicted at knot .89, but it was actually zero, so it’s a high loss.
And you do that again and again, and you train a discriminator. That’s how that works. Training the generator is a little bit more tricky, because what we have to do is first of all generate a batch of images, and then pass this through the discriminator to get out this number here, or what the discriminator thinks these generated images look like. But now, notice that instead of a target of zero, which is what we were using for the training of the discriminator, because we wanted it to spot these were fake, we actually now say, the target here is one.
Because we want this generator to generate things that are more likely to be true images. So when we back propagate these errors, we must freeze the weights of the discriminator. Because otherwise the discriminator is going to start training against this target variable, and we don’t want it to do that. We want it to train on its training process. So this is just for training the generator. But we need to use the discriminator as a mechanism for generating numbers that the generator can be trained on.
So, these are the two training processes, and what we do is we just iterate them. So you train the discriminator for a bit, then you train the generator for a bit. Then the discriminator, then the generator, and you literally just play them off against each other. And you say the discriminator, okay, get better at discriminating these terrible images from these real ones. And it gets the hang of that, and then you tell the generator, well do a better job. Because the discriminator is now pretty good at spotting your mistakes.
So then the generator gets better, and then discriminator has a harder time. It’s like two adversaries, which is where the name comes from, playing off against each other, and slowly getting better over time. And that’s it. That’s all there is to training generative adversarial networks. There’s obviously a number of ways that the training process has been enhanced and developed, and improved. Both in terms of the stability and the speed at which they’re trained. And there’s lot of information in the book actually about things like the Wasser-stein GAN, which was one of the first ways that the stability was improved, and WP GAN which is now the state of the art way of training a GAN.
And you can do similar things that you can do with VAEs. So you can use them to generate new faces, you can see the quality of the faces here is incredible really. This isn’t even state of the art these days. This is just StyleGAN, so not even the second variation of it. So you can do that as you could before. You generally get sharper images with GANs than you do with VAEs. They tend to be a bit blurry. There’s a number of reasons for that. But the other thing you can do, I haven’t got it here, but obviously you can do the same playing around in the latent space that you could with variational autoencoders as well.
So, I’d encourage you to play around with it and just find out for yourself what these things can do. Okay, I want to very quickly now cover a topic called world models. So we’re going to go back to the VAEs that we saw earlier, because we want to ask the question well, can we use these in another context other than just generating cool faces, because it’s nice, but how can we actually use this in a practical sense. The answer is yes. So you can see here, this is something called reinforcement learning, where we’re trying to train an agent to achieve a task in an environment. And what happens is the agent performs actions, so in this case it might be driving left right or pressing the accelerator, and we give that action to the environment.
The environment then says, okay, I’m going to give you a reward if you do something well, and I’m going to give you what the new state is that you have to use to produce your next action. And so the loop continues. And there’s obviously tons of techniques out there for training reinforcement, learning algorithms. But what was impressive is how they applied generative deep learning to a reinforcement learning setting. So, I’m going to take you through this diagram. This is basically a very stripped down version of what they did.
You can see here, this is the same diagram that we’ve just seen where the environment gives a state back to the model, and the model is ultimately giving an action back to the environment, but right in the middle here is a variational autoencoder. And what this is doing, is it’s mapping this image into a latent space, and then the world model, it’s job is to understand how that latent space evolves over time. And it’s much easier to understand how the latent space evolves than it is the pixel space, because the pixel space is full of noise.
It’s full of things like this barrier here that really don’t do anything in the environment. It’s full of these green pixels which actually mean nothing to this car. All it cares about is the road immediately in front of it. So, what they realized is they can built a latent space that ultimately the world model learns to model over time, and understand how it might evolve given an action that it has just taken, which is what this dotted line is here at the bottom. And then they use some other techniques.
They used evolutionary algorithms to build this controller over here, which then says well, given what the world model is telling me, how should I translate that into an action? So that’s all very good. Just to give you an idea about what this is doing, this is the output from the variational autoencoder once its been trained. So you map here into, I think they used 32 dimensions, and then re map back into the pixel space, and you can see here the two are very closely aligned, which means that with just 32 numbers, they could tell you pretty much exactly what the state of the environment was.
As opposed to the thousands of numbers that this pixel space represents, because it’s in a huge grid, and there’s three RGB channels for example. And the other thing you can do obviously is play around with this latent space like we saw before. So you take the Z, which is the hidden representation, tweak some of the variables, and you can create new tracks that may exist in the world. So you can see here just playing around with a few bits, you can create bends. So what’s important to understand here is that the model understands what is and isn’t possible in this world in terms of states that it might come up against.
So all it needs to do now is understand how the Z evolves over time. It doesn’t really need to know anything about what he environment looks like. It’s whole understanding of the world is in this latent space. So the agent learns a world model of how the latent space evolves given an action, but the crucial step, and this is where the paper made progress on what had been done before, is that they realized that actually you could replace the entire environment with a copy of the trained world model, and then train the environment within its own dream of how the environment evolves, rather than actually asking the environment what happened.
That’s the real magic of this. This is an example here of where, you’ve used the environment to get a feel of the physics and what happens, but not to train it on any specific task. So when you’re training the world model, you’re basically saying, just play around and see how things evolve, much like a child might do. Knocks into stuff, realizes that certain things cause it to go off the track, certain things cause it to spin around in a circle and not really get very far, and so on.
Once it has that world model understanding, you can then say, now train yourself to go as fast as possible around the track. And it’s like, “Oh yeah. Okay, I understand. Because now if I press the accelerator, I know that I go forwards. I don’t need the environment anymore to give me feedback, because I’ve learnt that.” So you can actually just replace the entire environment with its own world model. So this is what it means by learning in its own dreams. This is actually learning here, so once you’ve trained it, you can see here this is what it might imagine is coming up in the future, and it’s taking action.
So you can see there, it turned left. None of this exists. We’re not using the environment for any of these reconstructions. It’s taking this Z, imagining how it evolves over time, and just throwing this back out through the decoder so that us as humans can see what’s going on. It doesn’t need any of these images, it’s just understanding how the latent Z evolves and giving itself rewards as well, and that’s really important. It learns not only how the environment evolves, it also learns what reward would I get if I did this? Because it’s learned that from the environment, too. And amazingly, this is not just as good as stuff that’s come before, it’s better.
If you learn in the dream, you achieve state of the art much quicker and in fact even better than has previously been achieved. So this is the result straight from the paper. Quite amazing that it achieves almost a perfect score having learnt never in the environment, but only in its own dreams about how to take on this task. Now, why does this help AGI? Artificial general intelligence. So, let’s just imagine what this might mean for the future, and why this is important.
First of all, I think what’s important is to start from ground principles and ask, what are we doing at the moment in reinforcement learning? What we’re doing is we’re generally asking the question, how do rewards that the environment gives back convert into actions that I should do? And in supervised learning, another massive field of machine learning, we’re asking the question, how do labels that I give the model, how should I convert those into predictions of things I haven’t seen before? But, and this is what I’ve been thinking about, and I encourage you to as well, is do any of the things really exist in the world that we put an agent into?
Or is it something we have to give it? And the more I think about this, the more I think actually, the idea of labels and actions and rewards in an environment, they don’t exist. Nature isn’t so kind as to tell us what we need to do, and that’s profound to think about things like that. But, when you start thinking about it, you realize more and more that the only thing that the agent has to work with is this field of data that it exists within, and in my head, the only thing that I could think of that an agent can do, is to imagine how that data might evolve over time. And perhaps if agents get good at that, then they naturally are better at doing things in the environment that make them stay around longer.
And that’s the key idea here is that instead of giving an agent a task that we as the human overlords give it to do, we need to find a way of building agents and environments that are free for the agent to explore and explore generatively and understand how this environment evolves over time given things that it might do in that environment. And I think that what would happen if you did that is that over time, with a powerful enough agent, is that it would realize that it exists in the environment, and that there’s a direct correlation between things that it’s seeing itself do, and things that happen.
And that should come about not through us telling it that’s what it needs to look for, but because that’s something that it finds intuitively that it needs to explore more about, because it’s something that it doesn’t understand now. So this idea that agents need to be inquisitive is something that’s come into the literature a lot recently, and that actually training agents to simply be inquisitive is enough for them to start displaying intelligence. And that this is an example of three of the ways that generative models is starting to creep not only into machine learning literature, but also into psychology literature.
The idea that the brain may just be one big, whether it’s generative adversarial network, or whether it’s some form of generative model that’s just looking at things and trying to imagine what might happen into the future, because that’s what it needs to be good at, is something that lots of people are exploring now. It’s creeping more and more into people’s ideas about what machine learning should really be trying to achieve, and we shouldn’t be putting too much structure on that. So, what might this mean and how does this work in practice, again, this is my own thoughts, and feel free to have a thing about this on the tube on the way home, and it’s quite fun to think about how this might happen.
Well, what if we could train generative models that the agent builds up about its environment just like the car example that we saw before? But instead of its sole goal being to solve just one task, that it creates its own tasks to solve. And that would mean it has to create a rewards system. So a function that takes things like the state and converts them into a reward that it wants to learn, not because the environment has told it to learn that, or we have told it, but it wants to.
And also, to understand what it wants to understand is an action, what it wants to understand is a label. I think that’s the way that machine learning needs to go, and if we’re going to build something that’s truly intelligent at multiple tasks. Obviously, this is a field that’s very nascent and naïve at the moment, but I think we need to start thinking about this as data scientists if we really want to progress AGI. I just want to finish up with this video. Have a look at this agent in this environment and ask yourself, where is the reward coming from? Okay. So all its done is just put a block on top of another block, and there seems to be some reward function there that’s kicking in quite spectacularly.
Yeah. Well, exactly. Dopamine is [crosstalk] Yeah.
Then eventually, that [inaudible] go and maybe stack something up or something.
Yeah, and I think a really interesting point on that is how does it control dopamine to achieve what it wants to achieve? Rather than the environment telling it how to use its own dopamine if you like. So, there’s obviously some sort of overdrive of enjoyment here, just from [crosstalk]
Possibly happen is that [crosstalk]
We need to build things that are just intuitively know that it needs to explore the world more in order to create situations like this. A great way to summarize it, Richard Feynman, one of my own personal heroes, said, “What I cannot create, I do not understand.” I think equally is true is what AI cannot create, AI doesn’t understand. And that’s the end of my talk. Thank you very much.