# Introduction to Deep Learning, Keras, and TensorFlow

This meetup was held in Mountain View on March 13, 2018. This fast-paced session starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next, we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful. If time permits, we'll look at the UAT, CLT, and the Fixed Point Theorem. (Bonus points if you know Zorn's Lemma, the Well-Ordering Theorem, and the Axiom of Choice.)

**Talking Points:**

- Highlights and Overview
- The Data/AI Landscape
- Gartner 2017: Deep Learning
- The Official Start of AI (1956)
- Neural Networks With 3 Hidden Layers
- Linear Classifier: Example 1
- Linear Classifier: Example 2
- Linear Regression
- Sample Cost Function
- Euler’s Function
- The Sigmoid Activation Function
- TanH Activation Function
- The ReLU Activation Function
- The softmax Activation Function
- Activations Function in Python
- What’s the “Best” Activation Function?
- Cost Function
- How to select a Cost Function
- Setting up Data and the Model
- CNNs Vs RNNs
- CNNs: Convulsion and Pooling
- CNNs: Convulsion Calculations
- CNNs: Convulsion Matrices
- CNNs: Max Pooling
- What is Keras?
- CNN in Python/Keras (fragment)
- TensorFlow
- TensorFlow “primitive types”
- TensorFlow: Constants (immutable)
- TensorFlow Arithmetic Methods
- TensorFlow placeholders and feed_dict
- TensorFlow and Linear Regression
- TensorFlow fetch/feed_dict
- Saving Graphs for TensorBoard
- TensorFlow Eagar Execution
- TensorFlow and CNNs
- GANs: Generative Adversarial Networks
- Deep Learning and Art
- What do I Learn Next?
- About Me: Recent Books

**Speakers:**

Oswald Campesato, AI Instructor/Developer, iQuarkt

Audience member 2, Unknown, Unknown

Audience member 3, Unknown, Unknown

Audience member 4, Unknown, Unknown

**Read the Full Transcript**

**Oswald: **

Hi, everybody glad you could make it. It's good to be here. I will be your tour guide for the next hour or so to go through some of the concepts of deep learning. Before getting into that, just with a show of hands, how many people are new to deep learning? Okay. I guess you're in the right place and the others are experts, so they will be correcting me as I go through this. I'm going to skip some of the slides, but they'll be online on slide share. Plus it'll be recorded. And I'm going to probably skip over some of the more mundane details, not mundane, the use cases which are not mundane, and some of the histories. So I'm going to try to get the stuff that you can understand the concepts because after you understand the concepts, the APIs, well guess what, it's just the code that implements the concepts.

And so if you've tried it the other way, and if you've been lucky that way, you've been more successful than me, because I did try doing it that way and I got nowhere. So with that in mind.

### Highlights and Overview

Here's just a little quick overview of some of the things we're going to be talking about. And they kind of come together and cluster some of these concepts, so it's not that it's exactly sequential, this is the kind of stuff that we'll be going over.

### The Data/AI Landscape

And so with that, I think this slide is a little bit out of date. I'm not sure. I don't remember where I got it, but you don't have to learn all of these things to do deep learning. We're going to concentrate on the red dot and the data science part is a little bit off. You can actually do deep learning with data science.

With our studio, there's an interface to be able to do Keras and TensorFlow. And it looks very similar and what it does in brief, there's this bridge class that essentially delegates the work to Python. So it was actually very clever that the guy who wrote our studio, I think he wrote that interface with the author of Keras. And the other thing is something that's not here, something called reinforcement learning. And if you've heard of those systems, where they play a million times against themselves like alphago, that's reinforcement learning. So that's one thing that would probably be a good thing to have in there. And just so you know, the original one was alphago. I don't remember the exact names of the next one. There was alphago zero and then alpha zero.

And it's interesting because alphago had some human collaboration and alpha zero was purely completely all software, no human interaction. So the system learned how to play go, then it played the original alphago. And I think it won 75 out of a hundred matches. And the time it took from starting to finishing those games, any guesses, some of you might know, four hours. Wow. It just crushes the competition. It's kind of exhilarating in a way.

### Gartner 2017: Deep Learning

But anyway, here's one thing I put in because last year was the first time Gartner put deep learning separate. And I was a little bit surprised because deep learning's been kind of the driving force in AI the last five or six years. So glad to see that machine learning is a little bit to the right. So I guess it's further ahead to go wherever it's going to go. And so let's see what it does this year. I think they usually come out in October.

### The Official Start of AI (1956)

These people put together the first AI conference summer of 1956 at Dartmouth. And in case you're wondering, John McCarthy, he just happened to be the inventor of LISP. There's Claude Shannon, who happened to be the inventor of information theory, also called the God da Vinci of the 20th century. I almost said godfather, but that's Jeffrey Hinton. You'll get to him later. And there's also Marvin Minsky, one of the giants over there at MIT. And so what I really like about it is that they thought that they would get it all done by the end of the summer, that's optimism for you. And so basically in the fifties, by that time, you could sort of think of it as having penciled in if you will. Traditional AI, which was based on expert systems that were very popular in the eighties and they're still useful.

And versus machine learning deep learning, which was about lots of data, lots of inexpensive computing power algorithms. And of course for deep learning deep neural networks. And so that's it for the history you can read about more of it if you want online.

### Neural Networks With 3 Hidden Layers

This is the main thing we're going to look at, spend a few minutes on this, and you might be wondering, what is this thing for? Well it's labeled, of course you have the input layer, there's an output layer and these hidden layers. So the idea is to come up with a set of numbers for the edges, so that you get a neural network that models the data well, whatever that means. And once you're done with that, you freeze the model and then you test it with your test data. And if the percentage accuracy is roughly the same, you've got a good model, of course, the devils in the details.

And so the question might be wondering is how do you figure out how many layers to have? That's not obvious, it's actually an example of something called a hyper parameter, which is something that you set before the training process. So the number of hidden layers: hyper parameter, the number of nodes in the layers: hyper parameter, the initial values for the weights: hyper parameter. And there are lots more of them we'll get to them. So let's just, to make it simple, let's just assume that the initial weights are random numbers for normal N zero one distribution. It's not that important for our purposes tonight. So what you do with these frameworks is provide input data and it's in the form of a vector and it's numeric. And so in between the two layers, you see all those edges, you represent those weights with a matrix.

So when you have a set of input numbers, you multiply with the matrix, and then the next one, and the next one, and then you get to the end. And let's just pretend that there's only one node at the end, because the example I'm going to use will get into more detail later is something like housing. So you may have seen spreadsheets, Excel spreadsheets, where you have rows of data. And there are many attributes that you can have for features for a house number, square, feet, bedrooms, all that stuff. I think I saw one spreadsheet that had 30 features. So you pass in those features the values for each row. And then when you get to the end, you get a number and you compare that with the actual cost of the house that's in the row in the spreadsheet, they're going to be different.

And so going from left to right is forward propagation. What we need to do, or the frames do for us, is somehow go the other direction, it's called back prop or backward air propagation, in such a way that we modify the weights to make it better because we want to minimize the error. So we'll see more of this too. What happens is we have a cost function that's based on the parameters of the network. This is all done for us and what we try to do, or what it does for us, is find a way to get toward whatever the minimum is for that curve, that surface. And if it's in multiple dimensions, we can't see it. So it uses grading descent. If you haven't heard of that before, essentially partial derivatives, and it is computed using the chain rule, partial derivatives in the product in numbers.

And it involves something called a learning rate, which is also a hyper parameter, which kind of controls the radio, which you move forward. And so what happens is basically right over here, compute this number and then update the weights. They could increase, they could decrease, they could be zero. And then do that with the next layer back. So you get all the way to the beginning, you've modified the network. And so then you do that with the next row. So for example, something like MNIST has 60,000 rows for training. And each time you go through all this set of rows, that's called an epic. So it's actually quite common to go through the data set 20 times. So that means you've gone forward and back 1.2 million times. So you expect this going to have something of value when you're done with that.

And when you're done, again, you get the test data, which is about 10,000 rows. And if this difference in the percentage, accuracy is significant, it's probably a case of overfitting, which is not unique to deep learning. It happens with machine learning and other systems. And has anyone not heard of overfitting? Okay. I'd have to explain it. Good. So with that in mind here's what I'm going to do. there's basically three large categories of algorithms. One type is clustering, K means K-nearest neighbor, in case you've heard of that. And there's also something called mean shift, which is an alternative where you don't have to specify the number of clusters. And we're not going to do that tonight, but we're going to look at classifiers, which are things that we'll try to figure out which object or thing is in at the end of this whole process, from a list of things that you already know.

For example, you might have like a dog, cat, fish, bird. And so you've got some images and you want to figure out what's in there. Well, that's what that system will do. That's a classifier. There's also ones where you only have two outputs; true false, spam not spam, will the stock price go up or down? So it's binary. There's also in the case of MNIST you have 10 different digits. So you're going to have a classifier trying to figure out which of those 10 digits, the other category is called regression. And those are the kind that are basically continuous values. So instead of saying, is the stock price going to go up or down? What's the stock price going to be? What's the temperature going to be? Barometric pressure, heartbeat, heart rate, that sort of stuff.

### Linear Classifier: Example 1

So with that in mind, let's take a look at, I'm going to do a very simple example. You may not have seen this before. You won't have to remember this because you'll be doing this kind of stuff with frameworks, but just to give you an idea of what's happening, I think you'll be impressed later with what the framers do. So here's an example. We're going to try to figure out, pardon my LA artwork here, we've got these red dots. They're in the upper half of the plane and then the blue dots and the lower half. And so the dividing line is Y=0. Notice there's a unit vector. Everyone is familiar with vectors? Okay, good. So it's a unit vector pointing in the direction of the data that we want, which determines whether a random point is going to be a red dot or not. So how do we convert this diagram into a network?

There it is. Again, my manual artwork here. You notice the value (0,1). Where does that come from? (0,1), 0 for X, 1 for Y, that's going to be the weight. So what we do and what the systems do is they take the inner product of whatever value for X and Y is supplied, which is going to be a point in the plane. So X and Y a product with (0,1). What does that give us? X times 0; 0 is Y times one. We compare Y with 0, which is the threshold value. And when will that fire, when Y is bigger and equals zero, we knew that. This has to be the same. Pretty simple, straightforward, trivial. Okay. Let's do it four times.

### Linear Classifier: Example 2

Now take a look at this one. You see at the bottom, that's still the same thing. And now we've got B seat and D four lines. They are actually going to be half planes. The intersection's going to be this square. Each one of them has an inward pointing normal vector, perpendicular. And if we just take those numbers, we're going to have XY input, but we're going to have four nodes, four neurons, each one corresponding to a line, everyone with me. Okay. So here's what it looks like.

Remember A was (0,1). I just moved that up to the top and then B, C and D, the numbers on the left side. Now just not worry about the numbers in the middle, the threshold values for the moment, but we've got ones coming out. So when all four of those threshold values are met or exceeded it's going to emit a 1 and each 4 multiplied by 1. The sum total is 4. Then we'll know that that point is inside that rectangle, actually the square. Does that make sense? So the only part that might be a little tricky is those threshold values. And I'll just tell you what it is, for the first quadrant. It's a little bit different for the other ones, but for horizontal lines, you need to take the negative of the Y intercept. For vertical lines, the negative of the X intercept.

So going from A counterclockwise, the intercepts are the values, 0, 1, 1, and 0. So the negatives are 0, -1, -1, 0. Right? Okay. So let's look back. Where did I put it? Right there. So you see from the top to bottom 0, -1,-1,0. Let's test this, let's take the origin. We're going to include the boundaries, the perimeter as part of the included as red dots. So 0,0, if we supply X and Y of 0, 0, well, anything times 0 is 0, added It's all zero. So we're going to get a column of numbers to the left of the middle nodes are going to be zeros. Do they all equal or exceed the threshold value? Yes. One is emitted from four of them. We get 4. It's a red dot. Let's try one more. Let's try (1,1). So now what happens when we put 1 and 1 for X and Y? What happens is with the inner product, you're just adding the two weights from X and Y to a particular node.

So in this case, it's 1, -1, -1, 1. And in all four cases, that number equals or exceeds the threshold value. So ones are emitted and we get a 4 and it works, right? Everyone convinced? You can try more values if you want, but I know it works so I'm going to leave it at that. Before we get too much farther on this one, what if we had a triangle? Then we'd have three nodes, three 1, and a 3. What about a Pentagon? It'd be five nodes, five 1, and a 5. Now you have to figure out what the threshold values are and the weights there, because it's going to be the perpendicular vector pointing inward inside that shape. Little bit of work, if you want to do it. What if we had an ngon? Well, then it would be N nodes, N 1's and an N.

What if we had two rectangles, one the current one, and then one in the upper left upper right corner someplace. What would happen? Well, it turns out that vectors are invariant under translation. So those four vectors will work for the other rectangle that we have somewhere else. So that means what we do is we replicate those four nodes. The weights are the same. The ones are the same, and then there's another node with a four, but you have to get the threshold values based on what I was saying before. Does that make sense?

Sort of? So what if you have this really weird shape that isn't a polygon? Well, one thing you could do is to take the left and right extreme, take the difference. Say, divide it up into a partition of a hundred segments. And then what you could do is construct line segments for the top and the bottom. You have two polygons with 200 sides, which we talked about. What if someone says, I have a polygon that's not convex? No problem. Because every closed polygon can be decomposed into a set of closed convex polygons in the plane, if you need that. So basically we're done.

So you've now sort of completed exercise set number one. And this is also kind of interesting because these operators are basically the corners of the square and that one at the bottom is very interesting, It's the XOR. And it turns out that Marvin Minsky, from the gang of five, and I forgot his name. I always forget his name. The guy who invented logo, Seymour Papert. They wrote a paper in the late sixties, proving that when you have XORS, it's not only nearly separable with one layer. And that could have also been part of the reason there was the AI winter. I remember Marvin was part of that group. So it wasn't like he had an ax to grind, but I think it was sort of, I could imagine the conversation was really interesting theoretical stuff, but kind of useless, because you can't even do this.

Fortunately things have changed. We've got algorithms, we've got a lot of other things. Otherwise we wouldn't be here today. So if you want an exercise, I haven't done it myself just, but it's maybe worth doing that manually. If you want to get a little bit more practice. Try this one, this one's a confidence builder, you're doing it in 3d. I've yet to do that one too, but I'll get around to it. So again, instead of having an inward pointing vector for a line, you're going to have an inward pointing vector for a plane. And there are six planes, there are four vertices at the top, four at the bottom. So you have those to deal with and you are going to have the intercepts and all that other stuff. So that's what frameworks will do for you so you don't have to sit down and do it manually.

Imagine if you had a polygon with 10,000 sides, think of the work that you would have to do. Not even Dustin Hoff and Riemann could do this kind of stuff a thousand times a day. It really spares us a tremendous amount of work having these frameworks. A couple of things to keep in mind, this network doesn't learn, there's no back propagation, no cost function, nothing. And it's because of the so-called, well, the cost function. You remember, I kept saying, it's either going to emit a one, and I didn't say zero, but if it doesn't fire it's a 0, (0,1) is binary, and the interesting thing is we need something that if you look at it it's going to be a segment like this and something like that, one zero or for you would be the other way. If we connect those two things smoothly, kind of approximating it, what would the shape be? Kind of an S shape?

What function does that bring to mind? Sigmoid. What is interesting is that the sigmoid function gives us intermediate values instead of all or nothing. And that's what you need when your network is going to learn, because you're going to be tweaking those numbers. When I say you, I mean you or the framework. What's interesting is because of those continuous values, it's really like an analog device, maybe a little counterintuitive, but that's what we need. Let's take a look at something that's kind of the opposite.

### Linear Regression

Instead of separating things into one group and another, we're going to try to get a cluster of numbers and try to fit something so they're opposite, not separate, but approximate. Linear regression has been around, I think about 200 years. I think Carl Gauss was the person who started that. This is the simple case, we're not looking at all the other ones where it could be quadratic and cubic, just a nice little cluster of numbers.

And this is not curve fitting, it has nothing to do with that, that. The ideal line might intersect all of them, most, some or none. What we want is to find a line that is the least far away from the points, based on the vertical distance of those points from that line. So you take the difference for each of the Y coordinates for each of those points to that line, square it, so there are no negative values in cancellation, add them up, divide by the number of points. What does that give you? It's a quadratic function. Everything's now negative. If you look at this, it kind of looks like it's the best fitting line somehow, because if you move it up or down, that's changing the value of B, either increase B or decrease.

If you rotate it, you're increasing M or decreasing M. Those two variables are independent, so they would be like in the plane. And whatever combination of M and B that you get will produce an error value. Obviously the error is not going to be zero. Otherwise it would have to be all the points on the line. So the error, other than that optimal line, is going to be bigger than zero and it's quadratic, I'll spare you the suspense, that's what it looks like.

### Sample Cost Function

It's a convex surface. It has either a global maximum or global minimum. You don't have to worry about saddle points, that'll come up later. Or local minimum, local maximum, that kind of stuff. So the point at the bottom gives you the value of M and B for that best fitting line.

So imagine two perpendicular planes, intersecting parallel to these axes. You get two parabolas, they intersect at that point. Why do you want to look at the parabolas? Because you can take the partial derivative. Everybody remember how to do derivatives? The slope of the tangent to a curve, it will be zero at that point in both the M and the B. It's going to be a partial derivative back to M, and B sets it equal to zero. There's a closed form solution, basically done. Let's pretend we didn't know that. And we had a value of M and B that would put us somewhere in that curve. How would we go in general terms from whatever MB value put us over here to the bottom, to the minimum, how would we get there? By something called gradient descent. Imagine at that point that there's a tiny little sphere and you release it with what path it is going to take?

Whatever way this maximum descent, think of yourself being on the side of a hill in the mountain range. and you want to go downward, which way are you going to go? Where's steepest. It's a greedy algorithm. That's really all there is to it, in essence. Of course, in practice, there are things that come up, just make a mental note of this point because that'll come up a little bit later. As an example, I mentioned real estate.

So let's say horizontal axis is the number of square feet, vertical is the cost of a house. A very coarse grained approximation. What we want is something with more features like that, I came up with six. As I mentioned there are some data sets that have 30 of them. So those are the numbers in the spreadsheet, each row is the values for a house, and the right most number is the actual cost.

Remember, you feed in all the values for a given row, the values of those features go through this network, and then compare the result with the cost that's in the spreadsheet. We talked a little bit about that before.

So just more equations, I see how it generalizes. We have, instead of Y=MX+B, we have X1 to XN, and then B is the bias, the intercept.

### Neural Networks with 3 Hidden Layers

Just to go back, now when we're taking those numbers again, we go through all the way to the end. And there's one thing that I conveniently neglected to tell you, this is a linear system. By analogy, if you take the number ((2x3)x4)x5, what's that? 120. So if you're writing a program, are you going to put in ((2x3)x4)x5 every time you need it, or are you just going to use 120?

Obviously the latter. Well, the analogy I'm trying to make is when we take that first matrix, we can immediately multiply by the next matrix and all the way down to the end and produce one matrix, which collapses this whole system to input and output. We want to prevent that from happening. So we need to introduce nonlinearity and that's done by activation functions. One of which is sigmoid. Another one is TanH another one is ReLU, there's ELU, there's the exponential one, there is ReLU six, where it cuts off at six, that's specific to TensorFlow. All these systems have all the other ones. What happens is you have the numbers, the vector coming in, multiply by that first matrix. Then that new vector, each one of the values you pass through the activation function to get a new vector. Then you multiply by the next matrix.

Does that make sense? Just by analogy. If you go driving on the highway, there's nobody around. You can drive at a constant speed. If you go in a parking lot where there's speed bumps, it interrupts your flow. You slow down, you move up, you can't go straight through, or toll booths or whatever analogy helps you with the concept. So we can't just immediately go all the way through to the end. That's what the activation function does. And also enables us to find those computations with the smaller numbers so that we adjust the weights. It's all about the weights. That's what counts. And there's no pre-way of knowing which weights are the best. That's why you write these systems and you experiment with the different number of layers and so forth. When we get to the end, which we're assuming is just a one node, we have a cost function, kind of like the one that was there, that's called mean squared air.

And then you take these partial derivatives. It's called the gradient because the derivative, the slope of attention, only applies to two dimensions. When you're in multiple dimensions, you've got different axes, so it's going to be a partial derivative for each of those axes. And that is a vector of values. That's where you go. So when you see the diagrams online about going to the minimum, it kind of zigzags because of that. There is no straight line down, rarely unless it's the numbers just work out that way. Now we've got the idea, cost function, we need that. We need the gradient descent method, there's like five or six of them. Also, a hyper parameter and the learning rate. We must have those three things to do back propagation. We also need an activation function that prevents it from collapsing down, it has to be non-linear.

So that's the minimum. And then if you have at least two hidden layers, that's deep learning. If you have at least 10 hidden layers, that's called very deep learning. Seriously, I thought it was going to be like 500. I thought this is such a small number. By the way, the state of the art with neural networks. In 2011, somebody came up with a six hidden layer neural network that was state of the art, and then things kind of blew up. Then there was a competition in 2012, I think it was AlexNet, 150 hidden layers. Then Microsoft, a thousand layers. And then they have these massive networks. I have no idea how they test it? How long does it take? Jeff Dean, he's the head of Google brain. Usually, legendary is the word that precedes his name.

He really is wicked smart. I was at the Maytread last year. He's coming up next weekend I think. He mentioned, well you probably want to avoid training neural networks that takes more than three or four days. Thanks. So that's another factor. And then you get to TPUs with TensorFlow, the tensor processing unit, and on and on all that kind of stuff. But this is the basic fundamental idea. Forward propagation, back propagation, go through an epic, multiple epics, shift the data around, they shuffle it, this, that, and the other. And then to get the best number you can. How to come up with that, go to Kaggle, go to GitHub, borrow what other people have done. Start with one layer, experiment with it, work with Python if you prefer, or with Java, or you can do Scala, and you can use Keras. I recommend Keras.

It's a lot more intuitive as you'll see in a few minutes. And then when you really want to get the horsepower, you can use TensorFlow or PIE. Targ as well. That's another one that's popular with some people. So these are just the equations and skipping, skipping, and okay.

### Euler's Function

So Euler's function, does everyone remember Euler's constant? Or who does remember, I should say. Remember when you studied math, there was L O G, LOG that's space Tan, and then there was LN, that's space E, that's this number's 2.718, whatever. Pardon? So what's interesting is this is the only non-zero function, differential function, in the plane that equals its own derivative. And it has a lot of applications in a lot of systems.

### The sigmoid Activation Function

There's sigmoid. If you multiply everything by each of the X, it's each to the X divided by each of the X plus one. So you can see that the denominator is just a little bit bigger. So it's going to be between 0, 1 monotonically increasing. You'll hear the term squashing. You can take any set of numbers, pass them through that, they'll be like probabilities, because they'll all be between zero one. The softmax function is similar, except the differences are that the numbers that you pass in will also be between zero and one, but the sum will equal one. And that's important, especially for CNNs, which we'll see later.

### TanH Activation Function

And so here's TanH.

### The ReLU Activation Function

And now this is the darling of the day of the year, I guess, ReLU. Very simple to compute. There's a point there at zero where it's continuous, but not differentiable. So not to worry, it all works.

### The softmax Activation Function

This is not completely correct, I spotted it a while ago, but I haven't updated the slide.

That's the softmax, so essentially instead of saying X1 over X1+XN, and then X2, raise everything so that it's E to that power. Does that make sense? If you drop the Es, that's just the proportional weight for the set of numbers. Of course, if you take X1 all the way to XN that could be zero, but that won't happen when it's the exponent, because it has to be at least one, the sum.

### Activation Functions in Python

And just a real quick look at this, this is Python. You look at the middle one, what's the TanH activation function. It's called TanH. Very nice and convenient. The first, 1 over 1+E to the negative power. That's essentially that. And then the ReLU is the max of zero and the dot product. Kind of simple and straightforward. There are other ones as well, you can check them out online.

### What's the "Best" Activation Function?

I mentioned this already, ReLU is the one.

### Cost Function

Now what about the cost functions? We saw this before. That's the simplest one. And here's another one that has a saddle point because at one direction it's a minimum, and the other way it's a maximum, and there are techniques for getting away from those sorts of things. Remember, in three dimensions, it's easy to see. If it's a hundred dimensions, obviously you're not going to be able to see it because you can't draw it. There's something called momentum and there's nester of momentum, which is built into TensorFlow and you can specify a value.

The way I think of it, when it first made sense, was you're in an airplane and there's turbulence and you're wondering how long can I take this before I vomit? And then the pilot switches on that extra power and you get out of there and you relax. So that's kind of what you need to do the momentum to get out of there. But the thing is, how do you know that it's the saddle point? Maybe it is really the global minimum. So you compare. You're going to give it the momentum you go out, but oh wait, it's actually increasing and it should be decreasing, oh we got to go back. So there's this kind of game and there's like five or six of them. They're each better than the one before. There's a RMSprop, AdaGrad, hyper parameter, I don't remember the names, they're all built in.

Here's another one, this is the cost function, which is not really intuitive, but it's a measure of the extent to which two probability distributions differ. I know that doesn't make a lot of sense, but it works and you can use it. Treat it as black box until eventually it starts making sense. That's kind of how I did it. And there's going to be a lot of that when you're plugging in left and right, and trying things. It feels like seat of the pants programming. Deep learning is about heuristics and you try something, it does or doesn't work. And then you have an idea to do something and you ask somebody, what would happen if I did that? Try it. You do it and then it works really well, and then you go to the big standard data sets and you get better performance than they do. And then you write a paper and you put it in an archive and everybody goes great, we got a new technique. That's basically how it works. There's not a lot of documentation, those papers can be difficult to read. So there you have it.

### How to select a cost function

Selecting a cost function, just general rules. If it's mean-squared error, that usually means squared error for regression. Binary cross-entropy, categorical cross-entropy, this may not make a whole lot of sense right now, but there are some guidelines for selecting that cost function.

### Setting up Data and the Model

Now, something else I wanted to tell you about the data. Generally you try to keep things normalized. It just works better. So for example, with CNNs, pixel values are between zero and 255. You divide by 255 is between 0 and 1. One of the big time syncs with machine learning is feature extraction. Figuring out which ones are more important and the ones that are less important, you might have a hundred features and five of them are really important, and another five are sort of, and then the other 90, that long tail might be almost negligible. So a lot of time is spent figuring that out, cleaning the data, no duplicates, no incorrect data, no missing data. If you have data that's incorrect, what's the correct value? What do you do? Sometimes you just replace it by the average. Sometimes you put 0.

Sometimes you drop the row. What if that represents an outlier? Is the outlier significant? Well, if it's like the stock market, you bet it is. So you have to have a good solid understanding, or work with someone who has domain expertise with the data. Can you drop a column or add a column? It's not obvious. So you go through all this process and then what you do with deep learning is for each of those features you normalize, it's actually standardized, but you'll see normalized being used, it means something that's slightly different. And so you transform the data. So that it's N(0,1), meaning it's a Gaussian distribution means 0 center deviation 1. Well, you do that with the data, then it's all sort of level playing field. How do you figure out which features are more important than the others if it's all N(0,1)?

This is my favorite part. The answer is, what do you do? Nothing. Because deep learning does the feature extraction for us. That's the beauty of deep learning. That's why deep learning thrives on data. There's no such thing as too much data for deep learning. However, if you don't have enough data, maybe you have more columns than rows. What do you do? For example, with image recognition? Well, there are some standard machine learning algorithms that you can use. You could do something like K-Means. You could use SVM, support vector machine. So knowing what to do, when and how and what, means acquiring a certain amount of knowledge of deep learning, as well as machine learning to understand the nuances. Because some of the things that happen are not intuitive. For example, I didn't go into it, but there's something called the drop rate, Jeffrey Hinton.

I was talking about overfitting. That means that some of the noise is treated as though it were signal. So how do you fix it? One technique you just drop nodes. You have 20%, 30, 40, seems a little crazy, but it works. So the things to do don't necessarily align with your intuition. And that's why having the experience of different situations is what will help to guide you. And that's what's time consuming because getting that information. There's no one place that has it. And if you find it, please tell me.

So here's the dropout rate. Just some of the things in there, these aren't really that important right now. But later on, if you want to go over this again. Dropout rate, we can skip that. How many hidden nodes we kind of went through that.

### CNNs vs RNNs

CNNs versus RNNs. I'm not going to go into a lot of detail with RNNs. They're a lot more complicated, not intuitive and difficult to train and more difficult to describe, especially the sort of the main thing right now with one of them with RNNs is LSTMs, long short term memory, that gives you the ability to keep history. It's kind of like RNNs are stateful entities, CNNs are stateless. If you want to make an analogy. So for example, when you have a self-driving car, you've got images coming in. Each one is processed as the CNN, coalition network. Now, if you want to make sure you don't collide with anything, you've got to keep track of the history of where something's moving, that's where you have the LSTM. So you have image stuff, you identify the LSTM gives you the history and then you managed to avoid the collisions and other things in theory.

That doesn't always work because there was a car about a month ago driving, what was it? 55 or 65 miles an hour that hit a stationary fire truck. Does anybody remember that? So when you're a self-driving vehicle is following another vehicle and that vehicle moves out of the way, it sees a stationary object. It's like on the highway, there's expressway, you know those signs up there, oh you can ignore them, they're not moving. That was essentially the logic from what I understand. The solution, apparently all these systems have that flaw and Elon mosque is convinced that it can all be done in software. Some people say it needs to be done using more hardware, more sensors. So I guess time will tell how that works out. So the thing with CNNs, as I mentioned, is mainly for image processing, but also for audio and about 60% of all neural networks are CNNs. So this is probably worth your while to learn.

### CNNs: Convulsion and Pooling

And I'll just give you the basic sort of minimalistic scenario and then all the variations of the more interesting ones, the ones that you would actually do when you're solving a problem. What happens is that there's this filter process, it's a convolution, followed by a ReLU, followed by max pooling. Now the filter process is kind of interesting. You don't have to come up with the numbers, but typically you have an image you'll have a three by three filter. The system generates it. Usually you request, give me eight three by threes. Usually it's a power of 2 or 16 or whatever. And so you have nine numbers and you match it up on the top left corner with the image. So it's like an inner product of 2 three by three vectors. So you have nine products, eight sums and you get one number.

You move that filter across and generate these numbers. So you populate another array of numbers. And if you go over one at a time, it's the stride. Strides can be one or more horizontal, vertical; they're independent. And you can also, since it's going to be smaller, you can also pad it with 0s. And then that's done before the filtering, by the way, a little detail there. And so you end up with something called a feature map. And the idea it's actually based on the way our eyes work, which is different parts of our eye can recognize different shapes. Some vertical line or horizontal, or maybe like an oval shape. That's the idea it's emulating that, not the neuron stuff, just your actual eye. So that's how it was modeled. So you come up with these feature maps. They're not images. You could treat them as that.

You will see something. However, because of the numbers that generally are integers between -2 and 2, you could end up with values that are negative. So that's where value comes in. Negative is replaced by 0.

### CNNs: Convulstion Calculations

Before getting too far with that, here's an example. So you see that green square and you see this one here, only the top road, there's the one in the product with the 42, the result is 42. It doesn't start there, but that's kind of partially through that process, that's what I was trying to explain before. That is a very simple filter, probably kind of useless. You need more stuff. Here's some examples.

### CNNs: Convulsion Matrices

That will sharpen your image, this one will blur. The blur is because they're all the same values, as you move across it's like a neighborhood of this point and it takes into account its neighbors. So it smooths the peaks, but it also makes the image a little bit duller. So if you need to, that's what you use. Here's detecting edges and emboss. These filters are in Photoshop and those other things, all the tools, these are it. These are the filters they're using and others. So what would happen if you had just a 1 in the middle? What would that be? It would be kind of like the identity filter, because it would just pick up whatever's in that particular cell and replicate that. What if you had a one in the middle and a negative one in the left?

Well, when you have two consecutive adjacent pixels of the same color, the sum would be what? 0. So you're going across 0, 0, 0, 0, and then it changes. What does that mean? You hit a boundary. Otherwise it's all the same color and there's nothing in there. It's just one single same consistent color. Even that simple little filter can help you detect edges. So there's vertical, horizontal, and then cumulative. So the edge stuff that's detected, the next layer in the neural network will then kind of figure out there's these polygons or ellipses. And then it starts getting into the features, putting them together to actually recognizes, it's a head, it's an arm, and then finally, it's a man sitting at a table with a cat on the table, or whatever it is. That's the first two parts, here's max pooling.

### CNNs: Max Pooling

Again, simplest scenario, two by two subdivision, take the largest number. And that gives you something that's half as wide and half as tall. You're throwing away 75% of the values. Why does this work? I'll give you an analogy with compression algorithms for binary files. There are two types, there's lossless and there's lossy. What is JPEG? It's lossy, but it works. So that's kind of the idea. However, put a big asterisk next to this, because Jeffrey Hinton, who was involved in coming up with this. He said, and I don't have the exact quote, but pretty close, he said "the success of max pooling has been a disaster for convolution neural networks." He's one of those soft spoken, brilliant contrarians and he's been right so many times that when he says something he's probably onto something. And so it's something called capsule networks. We'll look at that a little bit later. So now, you know, oops, before we get there. So we do the filters, we get the feature maps, do the ReLU, max pooling and then do it again.

The filters are for extracting features. Then we have to do something, the classification. That's another part. That's the fully connected layer. Because of the processing, those feature maps have to be stretched out into one dimensional vectors. They're all strung together. Each one of those points is a neuron. Each one of those neurons is connected to the output. So here it happens to be four of these things. But in MNIST, there'd be the digits 0-9, so these are like buckets and that's where the softmax comes in. So you got the whole thing, connected, softmax to the out. And then, I'm skipping details, you also have a modified version of back propagation that we described earlier. It's a little bit different because max pooling isn't a differentiable function. So it does some internal stuff to keep track. It all works.

And again, supplying all these images and updating the values for those filters to get better feature maps, to get that whole vest connected max pooled. And then there's a set of numbers between 0, 1 whose sum is 1, take the maximum, it's a dog, or it's a 3. Remember, there are only approximations. It doesn't come out as 100% probability. It might be 80%, but it works. Even though it's not close to 100%, on average, the aggregate there's a high percentage of success. And so the idea is coming up with convolution neural networks that are better. Now, two years ago, somebody won a competition, before I mentioned something about trying things, so what these guys did, they took that max pooling. They did that immediately, no processing, they threw away 90 or 75% of the image. They won the contest.

That was one of the things they did among other things. So it's these kind of simple, not necessarily intuitive combination of different things that you do, and it works based on your data set. So there's really no apriori way of knowing what will be the best. That's where the creative work comes in. And of course all the processing time that's involved. So it can be very time consuming. So you get a team, one person does infrastructure, one does algorithms, one does modeling and then you have a fourth person maybe, and then you kind of pull your knowledge together. That usually works better than flying solo with those competitions. That gives the view of the first part of convolution neural networks. At this point, I just want to pause and say one thing. You might be thinking, this deep learning stuff, maybe they just got lucky a few times, but most of it's just kind of fluff.

That's a fair assessment. I have two things to say. First, there's something called the universal approximation theorem, which states that any continuous function in the plane can be represented arbitrarily closely by a neural network. For those of you who remember, we had Taylor series, it's a polynomial expansion of a continuous function or differentiable. And then there's four A series, a combination of sign and co-sign for partial differential equations. You know, those boundary value problems? Which I really liked, and so now we have this. I was actually quite surprised, but it really does work. Now, the thing is that there are a lot of continuous functions in the plane, and remember a subset of continuous functions is the differentiable ones. There's plenty of those. In fact, there is an uncountable infinite number of continuous functions in the plane, each of which can be represented arbitrarily closely by some neural network, which tells you that the expressive power of neural networks is immense.

If that doesn't convince you, that's okay. Last summer, I think it was, there was a startup company, they created a barcode scanner for blind people. And the state of the art up until then was a $1,300 device. There's was $20 using a deep learning. And I didn't do much, it's only $20. And they trained it and apparently it would read, scan the barcodes and then after a while, what happened was this little barcodes scanner learned how to read the dietary information. You know, the ingredients, the percentage, nobody trained it, nobody planned it, maybe it was version two of the product. I don't know. No one could explain it. My answer is the power of deep learning. So this is sort of a nice little feel good story. Some of you might be thinking, well, first we've got the barcode scanner and then it's Skynet two and I'm not worried.

So anyway, that's just a little anecdotal sort of background to give you the idea that maybe there is something to this stuff. And I think I may have taken this out. Maybe. I don't have it in here. I mentioned capsule networks. They are an alternative to the CNNs. Meaning without the max pooling, so you take that out and instead of having individual hidden layers, they're kind of grouped together in containers or capsules. And there's this routing mechanism in voting pattern, algorithm rather, and there's quote on getup for this, and the purpose is to try to capture the relationship between the hole and the part. For example, if you have a face, two eyes, and nose, and a mouth, then there's something I call the Picasso face. You know, where the nose is in the mouth, the mouth is up in the eye.

If you look at a standard CNN it's translation and variance. So it goes, oh yeah, there's a mouth and a nose in two ears. So it's a face capsule networks are not as prone to being deceived by that, they will detect the fact it's not a face. Why am I saying this? Well, because there's something called generative adversarial networks. I don't know if you've heard of those. There are ways to take an image, modify the pixels, vary them in a way that's imperceptible to the human eye and yet defeat any neural network. There are lots of techniques to defend against that all have been defeated. So why is that important? Well, if you're in your self-driving vehicle and it's a stop sign and it thinks it's a speed limit that could be catastrophic. And it gets worse. A few months ago, somebody put up a paper in archive with an algorithm that describes how to mess up the image by modifying one pixel.

It's obviously very important and capsule networks are more resistant to that sort of deception, if you will, they can also do some other things that are better than the other networks. However, they're more difficult to train, they're slower I think. More complicated. They're not perfect. There are flaws and so on. And so Jeffrey Hinton's been working on it since 2011. So he's been very tenacious about it and you can find stuff online about that. So that's generative adversarial networks. And the interesting thing is originally it was done by Ian Goodfellow four years ago to generate synthetic data. So if you don't have enough data generate this nice stuff and mix it all in together, and then that was kind of the con commitment effect if you will. Nobody expected that. I'm not sure who came up with that idea, but there you have it.

### What is Keras?

Keras is written by someone who's working at Google and it's a layer on top of TensorFlow. And it's, as I mentioned before, it's more intuitive. It's a lot easier really when you're first starting because the APIs, you don't have to understand what's going on with the graph underneath. Although it's on top of TensorFlow and also Theano as well as CNTK. So it's well, I'll just show you.

Remember, we've been talking about models for the last 45 minutes? So for Keras, we import sequential. That's just like a container there's another type. Not to worry about the other kind, it's a functional model. Layers, dense activation. We talked about activation dense means they're all connected. So between two layers, every node connected to every other node, that's dense. Look what we have; model = sequential([ Dense(32, input_shape=(784,)),. See, when you have an image that's 28 x 28. Remember I said, you have to stretch it out to one dimensional vector. 28 squared is 784. So that's the input data, the pixels, numbers between 0 and 255. Activation('relu'),. Another dense layer; Dense(10),. Activation('softmax'),. You already kind of know what this is doing, sort of.

And by the way, you put this in a abc.py and then you type Python, abc.py, and you run it. And this is what you get for the summary. It tells you to see the dense layer or what's an activation. Gives you the parameters. Even that little neural network there, 25,000 parameters. It's not unusual to have 500,000 parameters, 10 million. And that's the stuff with the activation function with the cost function and all that stuff. That's all the work is being done. That's really number crunching.

### CNN in Python/Keras (fragment)

Here's another one. We talked about CNNs, there's sequential at the top. Dense, I just mentioned. Dropout, we talked about that. Flatten straight one dimensional vector. Activation. Convolution import conv2D, the convolution I was telling you about the filters. Max pooling. Adadelta, the optimizer, the gradient descent technique. Input shape, it says we're going to have 32 x 32 image. There's going to be three channels. I didn't go into it before, but it's separated into RG and B. The model is sequential. Look what we're doing; model.add(conv2D(32, (3, 3). padding='same', input_shape-input_shape)), model(activation('relu')). You know, all of this already. So let's take a look at TensorFlow.

### TensorFlow

It's kind of a deferred computation graph is what it is. If you've looked at ASTs, abstracts, and index trees on steroids. It involves using tensors, which are multidimensional arrays. This is a little bit non-intuitive, just more stuff about what it can do, the typical stuff that you expect. And I'll show you a little bit about tensor board in a couple minutes.

So here are the use cases, pretty much standard stuff, no surprises going a little quickly here. And so what you have, I mentioned, there's a notion of the graph, edges, nodes, operations, lazy execution, a session. In order to actually make something happen, you have to invoke a session and it's run method and then stuff happens. That's why it's deferred. There's also eager execution, which makes it look more Pythonesque and that's more recent.

It's not available in the standard download, it's PIP install TensorFlow. If you want the eager execution, PIP install TF dash nightly. You'll get that. That's in 141. The latest version of TensorFlow, I think right now is 1.6. And if you want it for the GPU, PIP install, TF dash nightly dash GPU. So what happens is, I mentioned already pretty much the same thing, you have a TF session object, and you invoke the run method.

For example, I'm just summarizing what I described here about the different order tensors. Generally you won't go past four dimensional tensors and those are actually what you use with CNNs. There are some people working with really large systems. Those are five dimensional tensors that they use, but it's unlikely that you're going to be doing that.

### TensorFlow "primitive types"

And so we have three types, constant placeholder, and variable. Try not to use constants because they get saved as part of the graph and they can bloat it. So you tend to prefer to use variables. And also variables can be shared. This is some of the little details not to worry. You don't have to memorize it. Let's see is the next one here? Little bit. Okay.

### TensorFlow: Constants (immutable)

So look what we have; we import TensorFlow as tf, that's the standard, tf is tensor flow. We have a constant, tf-const.py. Notice sess = tf.Sessoin(). And then what do we do? We print(sess.run(aconst)), of that constant that we define, which is a zero dimensional tensor. So the result is 3, and then you have to close it. Do you like this? There's a little bit slightly simpler way. This saves you one line. You do, with tf.Session() as "sess", print that doesn't really save you alot. You'll see, it'll be better when we get to the eager execution.

### TensorFlow Arithmetic Methods

Little bit about arithmetic, the operators are full of English words instead of the symbolic operators and you get the results that you would expect when you perform those operators. We can skip over this. A little bit about doing some other calculations. Notice we have some built in functions, approximating PI is that value. We have our sess defined. Let's go from the bottom up. First we're going to do (tf.div(tf.sin(PI/4.), tf.cos(PI/4.)))). Sine over cosine is tangent of PI/4 radiance 45 degrees. So that is going to be what? 10 of 1. What's cosine of PI radiance? -1. What's sign of PI radiance? So it should be (1, -1, 0).

Not quite. The last two are correct, but there you have an approximate value, but we also are using an approximation for the value for PI. So in case you need to do highly precise calculations, just keep that in mind. Why does the tangent work by the way? That was approximate 2, right? Because the sign and cosine of 45 degrees are equal. So the number's going to be the correct value plus error, some E. The bottom will be correct value plus E, well it's 1. It's just coincidental serendipity, or something like that.

### TensorFlow placeholders and feed_dict

So here we have the part where we can feed in numbers. We can have placeholders. If you have written C programs, it's kind of like doing something like INTx; and then later on you go X=3. So you declare it and then you define the value. That's kind of the idea. So we have a feed dictionary and look what we do over here at the bottom. sess.run, of course. C we pass in this feed dictionary because C is the product of A and B. So we have to pass in values for A and B. That's how it works.

### TensorFlow and Linear Regression

And based on that idea, this probably makes more sense too, because we have, what are we doing here? We're defining W, X, B. What does that suggest? We're going to do W times X + B, linear regression. And so what we do is we define Wx, and it's called Wx. And then Y is Wx+B. This is all deferred, nothing's executed.

### TensorFlow fetch/feed_dict

Here, see the feed dictionary. Wx, W is fixed, X didn't have a value. So we had to pass it in value. And then we get that result. If we want, Y, Wx + B, we have to pass it in something for X and something for B. Well, that's what that middle line is doing in the green. We get that. So you can see the start here, where you can do linear regression, pretty much anything you want. But you have to build up the stuff and compare that with Keras. In two slides, you saw what a convolution neural network looks like. Here we're just doing a line, right? I'm trying to be impartial because I think part of the purpose of a presentation like this one is to show you the different things that are available so that you can make a more informed decision about what you want to do and what you can based on the constraints that you're in.

### Saving Graphs for TensorBoard

So here's an example, the line in the middle, I didn't mention global_variables_initializer, that's another method that has to be invoked to initialize all these things that have been declared at the beginning, but haven't been initialized. You have to invoke that. If you do this line in the middle, the file writer, it goes into that directory, puts in a file, and it saves what? Session.graph. And so when you go into your browser, and I think I have it here, TensorBoard, there's the graph. I mean, pretty trivial, but it shows you, and then you can highlight things and expand and drill. It's very nice when you're trying to do some debugging. And it gives you the information on the left, and we didn't give these names, but you can do that. And the thing is if you have 50 nodes, it's going to get a little bit messy, because the graph could be quite complex or a hundred.

What if you have 500? What you can do is define components where all these things are inside of this one component, and then all the others in another component. For example, there's another benefit to doing that beyond just a cleaner graph, which is not intuitive, but when you hear it, it'll make sense. When you separate things into separate components, TensorFlow can execute those components on different CPUs and GPU in parallel. You get that for free. That's nice. Another incentive for doing that. In case you need to be concerned about that. So just going back to this, let's see, what do we have?

### TensorFlow Eagar Execution

Eager execution, as I mentioned before, you have to get the specific download when you install it. There it is right there. I already told you, you need Python 3.Xdot. I already kind of told you what it does. So you have that line, tfe, you enable eager execution. Now we have X defined as a one dimensional array tensor, multiplied by itself, we get four, which is pretty much what we wanted in the first place. Which do you prefer? Eager execution or the regular style? Obviously this is going to be better. Now as far as performance, I don't know. If you have a really large, massive system, one with traditional TensorFlow, the other one with eager execution. What the difference is in execution time, but that's something you can try.

### TensorFlow and CNNs

So here's a little bit of TensorFlow and convolution neural networks. We have a Python function where we're passing in all this stuff and it will construct the neural network. This is kind of like the decorator pattern. If you're familiar with that. Yeah, I think it is a decorator pattern. So what we do here, notice we have tf, and the other stuff is of course not shown, and with the layers.conv2D, notice we have input layer, and this means that it's 28 by 28. There's one channel. This negative one is just a syntactic thing for Python. Let's not worry about that. Filters 32. Kernel size 5 by 5, Padding. relu. Now, one thing I didn't mention before about the variations, with the kernel size, the filter size, is 3 by 3 is kind of standard, but you could do 5 by 5 obviously. And you can do it 1 by 1. Those guys that won the contest with the value with doing the max pooling first, they also did 3 by 3, 5 by 5, 1 by 1, and then they merged it all together. Is that intuitively obvious? I don't think I would've thought of that, but that shows you kind of the idea trying different sorts of things. See how they work generally they're odd sizes so that there will be one center point. That's a tiny little detail. Now we have the pooling layer at the bottom, after the conv 1, we do what? Pool size 2 by 2, strides 2. So it is the 2 by 2 sub, but it goes over to vertical, horizontal, not so bad.

And then a little bit more, there's another convolution layer. Then there's another pool, just like we did before. This code is actually not bad. And there's more stuff. If you want to see the full code it's right here. And actually what I did do is I did have GANs. I just had it in a different place.

### GANs: Generative Adversarial Networks

So we have, what? Panda on the left, the weird thing on the middle, and a panda on the right. But look at the precision there, it's a Gibbon with 99% accuracy. Remember, do you see any difference between those, the left and the right? I don't, but look at all that stuff. And remember you can do one pixel modification and defeat the neural networks. I think we've actually already shown you all of this. Up until recently, the focus was on static images. You can also create your own if you want. And there's GitHub link, get the code and also with MNIST. And here's the part I wanted to show you. It's very hard to see. What people have done recently is applying GANs to audio.

So you can corrupt that sound file in the sense that it'll say something else than what you expected. I think it gives new meaning to fake news. So in my kind of warped mind, I'm thinking we're going to get to the day where we have fake images, fake news, we're going to wake up one morning and go, what's my name? I don't know. It's fake. Everything's fake. I mean, sorry. Off on a tangent here.

### Deep Learning and Art

And so there's also a nice little link over there. If you go there, I have no affiliation with anything. Just I found this. You upload two images and it does the convolution and there's a lot of public stuff that's there that people have done, it's really nice and you can upload your own. And I won't go into the ones that I did, but I took, I'm not an artist, so I took my SVG code that's sends some JavaScript, generated some images, and then I took some celebrities and I kind of merged them together. Some of them were nice. Some of them, a little lame, but you can try it on your own and see.

### What Do I Learn Next?

And there's a ton of stuff you can learn. If this felt like a fire hose, it's just a trickle. That's what somebody told me actually about my presentation about three months ago. Lots of stuff that you can do. And the team model for learning, you learn something in depth and then you kind of go horizontal. I call it the pyramid model where you got a pile of sand, and if you want to go add another foot or more vertically, you got to pour a ton of sand because it spreads out. So you're learning horizontally and vertically. That's very time consuming, especially if you're doing that on your own to find stuff

So I recommend Udacity stuff. Udemy, there's videos there. Videos on YouTube. Kaggle. Blog posts. Do a little this, do a little that. Go to meetups, talk to people, share the knowledge and get some reinforcement, because it really is a lot about reinforcement and repetition as well as the technical details. But it's the familiarity that's a very significant part of that. And just about done last two slides,

### About Me: Recent Books

**Oswald Campesato:**

Just a few of the books I've written, the RegEx book is coming out, I think in May. And I do some training and that is basically it. Hope you got something out of it. Thanks for your attention.

**Audience member 2:**

Thank you. Thank you so much, Oswald. So if folks have any questions, I can bring you the microphone. If you're going to raise your hand, we'll start here in the back.

**Oswald Campesato:**

Only questions that I know the answer to please.

**Audience member 2: **

That's the requirement. Okay. Sweet. So I'll bring you the microphone, sir. There we go. Thank you.

**Audience member 3: **

Thank you for the nice presentation. Can you speak a little bit more about when pixel can screw up the whole network?

**Oswald Campesato: **

I have not read the algorithm. I'm sort of worried that when I read it, I'll get scared that it's so simple, anybody can do it. I'm sort of half facetious, but if you read the paper online, I don't know the details, but apparently he has succeeded in constructing such an algorithm.

**Audience member 3:**

So he has been kind of trial and error or do something like just..?

**Oswald Campesato: **

I don't know the details. But someday I will make myself read it.

**Audience member 2: **

Awesome. Any other questions? If you raise your hand, gentlemen in the back.

**Audience member 4:**

Slides?

**Audience member 2: **

Slides? Oh yes. So Oswald, we'll post the slides on slide share within like the next two weeks. This was recorded. Thank you, mark. And we'll post it on YouTube as well within the next two weeks. Yeah. Anything else? All right. so actually let me check Oswald, I think we had two more questions. Let me see here. Somebody asked how to test a neural network? In the software world, it's either true or false.

**Oswald Campesato: **

Very good question. I should know the answer. I do not. There are a few blog posts you can find online that address that specifically. And so, yeah, I'll figure that one out too.

**Audience member 2: **

All right. And final question. Is the neural network training process similar to methods like Newton Raphson method to compute square roots? Where do you iterate and wait for close values?

**Oswald Campesato: **

Yes, you can use Newton's method. That is actually one technique that's used. This is all the gradients that are computed this first order derivatives, but you can use quadratic kind of rate and it's a second order iteration. I forgot the exact name, but yes, those, there are those techniques.

**Audience member 2: **

Awesome. Okay. Any more questions? All right. Well thank you so so much. Thank you. And thank you everybody for coming.

**Oswald Campesato:**

All right. Thank you very much. Appreciate it.