Convex Optimization - Stephen Boyd, Professor, Stanford University
Stephen P. Boyd is the Samsung Professor of Engineering, and Professor of Electrical Engineering in the Information Systems Laboratory at Stanford University. He has courtesy appointments in the Department of Management Science and Engineering and the Department of Computer Science, and is a member of the Institute for Computational and Mathematical Engineering. His current research focus is on convex optimization applications in control, signal processing, machine learning, and finance.
Talking Points:
Applications of Convex Optimization In Business, Medicine, and Engineering
How Radiation Treatment Planning Utilizes Convex Optimization
How Dynamic Optimization Allows Different Models to Work Together
Speakers:
Stephen P. Boyd, Professor of Engineering, Stanford
Read the Full Transcript
Stephen Boyd:
So thank you very much. Before I start, Sri mentioned that that video interlude was to distinguish between the business and the math. So just, heads up, the math is starting. I'll give you some warning before we actually get into it. I just want to say even before starting that, as a professor and researcher who works on a lot of the core algorithms, it is just a fantastic feeling to see the things we work on actually, beautiful open source implementations and then things that run on GPUs that are unbelievably powerful. And then actually to have people use these things to make useful products. It's just fantastic for us. So, okay. So let me, I'm going to go ahead and start. I'm going to talk about Convex Optimization.
What is Convex Optimization?
Just heads up, there will be some math in here, so. There won't be too much, frankly. And in fact, it should be arranged in such a way that you should get. Even if you ignore the mathematical parts, you should get the basic ideas because the basic ideas are actually quite simple. So I'm going to talk about optimization which is a bigger subject in fact than just Machine Learning or ai. So it relates to it very heavily but it's actually a bigger subject than that. So what I'm going to do is what I'm going to talk about, first I'll just talk about what's Mathematical Optimization. And it'll be just English. Just like, what is it? What do you do with it? Why would you do it?
That kind of thing. And then what I'll do is I'll talk for a little bit about Convex Optimization. We'll get to what that is. It's a special kind, with special attributes. I'll show some very, very simple examples. And then I'll talk a little bit about some of the most recent research involving things like large scale distributed optimization. And in fact, that's exactly the kind of thing we're actually running now on GPUs. And in fact the very first demo you saw this morning with the green lights and red bars going across that was developing 4,000 models. That was exactly using the ideas of large scale distributed optimization, actually running on eight GPUs. So it all fits together very, very well. So we'll start with just what's Mathematical Optimization. And it's an optimization problem that looks like this.
Your job is to make choices of a vector. It's a set of some numbers. Could be a few. Usually it's a lot more like a thousand, 10,000, 100,000, or a billion, something like that. And you're going to be judged by absolute constraints you must satisfy. So, it depends on what the problem is. We'll look at some examples shortly. But the idea is there'll be constraints. You must be satisfied. These could be legal requirements, technical requirements, physics, things like that. And then you're judged by an objective. And that says how much you, well here we're minimizing. So F0(X) is going to tell you how much the choice X irritates you. And then the goal will be to make a choice which is least irritable among the choices that are possible, okay?
What Can You Do With Convex Optimization?
And that's the goal. And there are lots of variations on this. You could maximize something like profit or utility or something. And you can look at multiple objectives and things like that. Okay? So we're just going to talk now about what you would use optimization for? Well, the first and most obvious application or use of it is simply to make good actions, right? It's to actually solve an optimization problem to determine something good to do. And in that case, the things that you are choosing, that's that vector X, the X size. These are things like, they could be traded in a portfolio. So in quantitative finance, you would solve an optimization problem. This is in fact, universally done. You solve an optimization problem and that determines what trades and what amounts or where to trade and that kind of stuff.
This is used universally for control of airplanes. So there, for example, the X size, what you're asking yourself is, how should you adjust the thrust on your engines? How should you change the control surfaces? Maybe for the next 10th of a second, for example. It could be a scheduler assignment, right? Or it could be resource allocation at data centers. You would decide maybe in a few seconds time frame, how many more cores you should spin up or whatever, right? Or how much memory you should allocate to one type of processor. And it's used for wireless stuff where you would optimize the signal that you send, right? And you would send a signal that is going to be most likely to be decoded correctly at the receiver.
Applications of Convex Optimization In Business, Medicine, and Engineering
Okay? Now, the constraints here, they're going to limit or impose actions on the outcome. So, for example, they might be legal. If you're running a mutual fund, you're not allowed to have short positions, for example. If you are doing a schedule or an assignment, it tells you that yes, you do have to have one surgeon for a surgery, right? Or an anesthesiologist, that would be a requirement. And then the objective is simply something where the smaller it is the better. And that's often something like total cost. It could be the deviation from a desired target outcome. If you don't think you can achieve it exactly then it's the deviation from it. It could be risk in finance, for example, or fuel use in control. So, in these applications of optimization, the thing you're choosing is something to actually do.
It's an action, okay? Engineering design is something like that, except it's not an action in real time. It's an action maybe in an 18 month or two year time scale. So X represents the design. It's the physical design of a thing. It's how wide should that transistor be or how big should this component be or how small? And the constraints usually come from the manufacturing process. You can't make it too small. You can't make it too big. Oh, you have to be satisfied with the performance. If I'm designing the steel frame structure for a building, it has to hold up, it has to hold up under some earthquakes and things like that. So, okay. And then the objective is typically, in that case, cost, weight, power. Okay? Now, we get to yet another use of optimization.
How AI and Machine Learning Uses Optimization
And this is the one that's closest to AI. And in fact, it's pretty much AI statistics and Machine Learning. And in that case, this variable X does not represent an action. It represents parameters in a model. So any model you have, it could be any kind of model, deep neural net, Logistic Regression, any of these models, you have parameters. You might have a small number, and you might have a very large number. So that's what you're asked to choose. To choose these parameter values. Now the constraints imposed requirements on them, if you are, for example, estimating a covariance matrix, it has to be positive semi-definite. If you're estimating powers or fluxes, they have to be non-negative, for example. And then the objective is a combination of two things. It's the prediction error on the data you have.
Yeah, obviously you want to choose parameters so that your model fits the data you've seen. That's clear. But there's a second term and that second term penalizes model complexity, right? And that's there because, what you really want is not a model that works well on the data you've seen, but a model that works well on next week's data, for example. That's what you really want. Okay? So this is another very broad application of optimization. Inversion is a variation on this. So in inversion, this would be like an imaging. I have an image, but it's blurred. It's got noises, optical aberrations, all sorts of things like that. And then my job is to take that blurred or imperfect image, maybe with noise. And then I'm supposed to take that and make a guess of what was the actual image I was looking at, right?
Using Optimization in Worst-Case Analysis
So that's the idea. And now this is done very well and pretty much universally through optimization. So you write down an optimization problem and then you solve that or attempt to solve that to do your image inversion. I'll mention a couple of others. These are very interesting things. They're not part of the usual world of Machine Learning, but I'll mention them. They're actually really interesting. So optimization is also used for worst case analysis. And this is actually quite interesting. I mean, the name optimization sounds like you're supposed to be doing good, right? You're supposed to be making good choices, make good trades, make a good trajectory for your airplane. But in fact, it's extremely useful in practice for doing the opposite, for basically finding out what is the absolute worst thing that can happen.
Now, in this case, of course, the idea is that the variables are not things you are doing because why would you make things bad? The variables are actually their actions, which are under the control of an adversary. That's what it is. So for example, you might say, I hold this portfolio and here's a bunch of things I don't know here, interest rates, this thing, that thing, this thing, the correlation between these two financial sectors. And then you might say, with all of those things, what's the worst thing that could happen to me? And the answer might be this, the interest rate goes up by the maximum amount, these two sectors become correlated, these two become independent. And that would put you in a very bad position, right? So that's the idea. And the idea behind worst case analysis is, if the worst thing that can possibly happen under your assumptions about what can possibly happen is not so bad, that's cool, and you can sleep well or reasonably well.
Now what usually happens is, of course when you do this, you find that the worst thing that can happen is actually truly terrible. Then you look at it and you go, oh my God. And then the next thing you do is you say, well that could never happen. And then you move on, right? So actually what happens then is they say that could never happen and you add more constraints to make sure it can't happen, and then you're fine. But yeah, okay, so, alright. I'll mention one more. It's also very interesting. It's not a part of the normal mainstream AI. It's optimization based models. And so the idea here is that you model an entity as taking actions that optimize something, right? And an example would be, well, here's a good example.
How Convex Optimization Works in Economics and Business
Economics. You would say, here's the example. You'd say, well, an individual or an agent takes actions that will maximize an expected utility. Now, if you talk to someone, they go, first of all, I don't know what expectation means. I don't know what utility is. I'm just buying a car or something like that. So these are very bad models, right? They're not accurate. But you aggregate just a small number of agents and all of a sudden they have predictive power, right? So this is done in a lot of areas. For example, in biology right now they're making very sophisticated predictions about what happens to cells. And they simply make an incredibly simplistic assumption. They ignore all biology, they look at a tiny part of the chemistry, and they simply say, under these assumptions, the cell will maximize its growth rate.
Can We Solve Optimization Problems?
Now, the bad news. We can't solve optimization problems. Then they ask questions about what would happen if I knock out this gene and that thing. And then these models are actually ending up having to have extremely good predictive ability. Which is strange because of course what's happening in the cell is not solving an optimization problem. It's something unbelievably complicated. It's called biology, right? So it's very interesting. So don't make fun of these models, even though they sound ridiculous, they're actually very effective. And so these often work very well. So the summary is, optimization comes up everywhere. It's in making good actions, it's doing control, things like that. It's in finance, it's in all sorts of areas. And it's also in areas like Machine Learning where you're actually choosing parameters in a model.
So that's the bad news. So, sorry. By the way, the correct way to teach this course depends on what it is. But what you do is you make it really long so that by the time you have to admit this, it's usually the last day, and you don't admit it directly like this, you make it a little bit more complicated and you mumble about NP hardness. There's a trick to doing this, right? But the truth is, no one would dispute this though. Now the truth is, there are two ways to get around this, okay? So one is just totally ignoring it and just you run these algorithms, and if it doesn't find the point that has the lowest value, then you don't care.
Tricks For Getting Around Optimization Problems
I mean, and that's just fine. There's nothing wrong with that. That's what deep neural networks do. No one cares, right? Because you run a deep neural network. Besides, your job was not to minimize whatever the function you were minimizing, attempting to minimize was your job was to make a good model. And that is checked by out of sample validation. And how you got that model is actually no one's business at all, period. So, that's approach number one. Widely used, highly traditional, quite effective. However, it turns out. There is a group of problems which you can actually solve. They're called convex problems, right? And we can basically solve them. Now, I don't want to make a big deal out of it. It doesn't mean it's like a fantastic thing. It just means it's one less thing that you have to look the other way or be embarrassed about.
So, at least, if you're solving a convex problem, you do not have someone say, oh yeah, you're not really solving it. And you can say, weirdly, we are. So I'll say a little bit about that. It's going to get a little bit mathematical, but not for very long. So don't worry. And the basic ideas should be pretty clear anyway. So, a convex optimization problem. It's your basic optimization problem. You minimize an objective, you have constraints that have to hold, and you have equality constraints, but they have to be linear. And there's only one more requirement. And that is that this objective function and these constraints, they have to curve up. That's it. And this is actually really weird because you would think, I mean the old school idea was that basically things that were linear were easy, and things that were non-linear, that had any curvature at all were hard.
That's the old school thought. It's beautiful, makes a lot of sense, it just happens to be totally wrong. So, turns out, the correct answer is weird and asymmetric. It's this. Curvature is just fine as long as things curve up. So positive curvature is a non-problem, negative curvature is a problem, or at least is a problem if you intend to solve the problem exactly. Okay, so why would you care about Convex Optimization? Well, the truth is, a lot of people probably don't and probably shouldn't, but it doesn't hurt people to know that there is this group of optimization problems that can actually be solved exactly. It's a beautiful theory. There are effective methods to solve them. That's actually very important. And you actually solve the problem. And it allows you to unify a lot of things. I mean, the Machine Learning course at Stanford used to say, oh, this is Logistic Regression, this is LASSO, this is that. It's week three, so now you're doing SVM.
Applications for Optimization Curves in Business and Tech
And in fact, what happens now is people just say, they're just all convex. That's it. So, it's easy, simple. It's much simpler. By the way, easy and unified is super good because, while we've been here this morning, right? There've been three new algorithms that we're paraded out at NIPS or something, right? And it's actually very good. I mean, so a lot of this is being generated. A lot of entropy is being generated, right? Some of those algorithms are actually really good. But it's good to have a very simple conceptual framework to understand so you can actually be effective in weeding out the stuff that doesn't work from what does, okay? There's a ton of applications. So there's just a ton of them in Machine Learning and statistics, finance is filled with them.
Supply chain, revenue management, and advertising. Actually, almost all of these have moved towards these methods. Control is one. So I don't know if you saw the Falcon Nine, the SpaceX first stage landing, right? So actually this is a huge, terrible overstatement, but that was Convex Optimization. Okay? So 10 times a second. I mean, of course the answer is a team of hundreds of people doing all sorts of crazy stuff. But one of them was my former student who, just actually 10 times a second, computed an optimal trajectory to the landing pad. And they did it 10 times a second. And that's who did it. So, okay. Areas like signal and image processing, networking, circuit design, combinatorial optimization, all sorts of areas are using these ideas.
How Radiation Treatment Planning Utilizes Convex Optimization
So okay, let me explain how it's actually used. So this is how it's fielded. So the basics is, you have a real problem you want to solve. I don't know, it's a scheduling problem or something like that. And you try to formulate it as a convex problem, right? And if you succeed, well then usually you can solve it. I mean, it may be a gigantic scale, it may be real time where you must make a decision, in whatever it is, you have to make a decision in every 30 millisecond period, right? Like, for example, in a real time control system or something like that, right? So that's good. But there's tricks, right? And the tricks would involve things like change of variables, approximation of the objective and constraints, and something called relaxation, which is a very fancy term for this, you simply ignore the parts that you can't handle.
But then it's not usually admitted that that's what it is, right? Because it's unsophisticated. So usually it's dressed up and made to sound very fancy and things like that. But now, the weird thing is how often it works. So that's the weird, interesting thing about it, as an observation in practice. Okay? So we're going to look at some examples now. And the first one, it's going to be from radiation treatment planning. So the idea is that you have a patient with a tumor that surgeons do not want to go in after. Surgeons are not known for their modesty, right? So that must be a pretty bad tumor right there. And the idea is you take the patient and then you simply beam very energetic beams at them.
And the goal, of course, is to disrupt or kill the tumor cells while at the same time killing as few of the healthy ones as possible. Okay? So that's all pretty simple in the ideas. Very simple. Anyway and so here, the thing you're going to optimize are these beam intensity. So this is very much an example of that first type of optimization where the things you are deciding are actions. Because, it is for sure an action to send some incredibly energetic beam aimed at someone's head, right? That is for sure an action. So what'll happen is they'll be a bunch of voxels in the patient. You'll have the treatment volume, and you'll have like 100x100x100, so a million voxels or something like that.
And then because of the actions there'll be radiation doses received. And that's linear, that's y=Ax. And this comes from beam geometry and physics. In fact, that's a whole subfield is how to model what happens when you shoot a beam into a person. The simplest ones are to simply geometric the beam diverges as it travels or it attenuates as it goes through tissue, but complicated ones actually handle things like this, that it hits bone and scatters. Okay? So you can have all these things, but that's just linear. And the goal is to deliver some prescribed radiation dose. And of course, it's not possible to deliver the prescribed radiation dose to the tumor, and none to the healthy voxels. That's impossible. So there is going to be a compromise. And so this is done, I guess pretty much universally now by optimization, okay?
So this you can set up as a Convex Optimization problem. In fact, it's pretty straightforward. You simply charge for being over, for overexposing something, and you charge for underexposing. And the two charges, they're knobs that you turn to get the treatment design that you like. So this is the idea, and that's a Convex Optimization problem. So here's an example. This is a real one. This is a patient with two primary tumors here and here. And it's a very small problem. So it's got 360 beams and about a third of a million voxels. And this is one where we actually also got the actual treatment plan. That was done with my student who's an MD PhD.
Optimizing Image Inversion Problems
So, while I don't know what I'm talking about, he actually does, or I should say more accurately will since he's still a student, but yeah. Okay so this is just an example. I mean, this is a perfect example of one of these things where you're optimizing a set of actions. So the next one is going to be an inversion problem. So it's an image in-painting and that works like this. You have an image and some of the pixel values are just obscured or unknown. You just don't know what they are, and your job is to guess what they are. That's it. So, I mean, it's a reasonably straightforward problem. There'd be lots of ways you can guess how this should work.
For example, if you want to figure out what this pixel should be, you look at neighbors that are really there. And if all the pixels around there are orange, then that pixel should probably be orange. I mean, this is obvious. So it turns out, this goes back even into the nineties. A very good way to do this is to minimize a function, which is the sum of the norms of the gradient of this thing. So, by the way, they're way better ways to do this. Now the point here is just something that's something this simple works shockingly well. So that's a convex problem. And we'll just look at an example. So here's an image. There's about a million variables because it's the RG and B values at 800,000 pixels, right?
And then we simply remove, I forget what it is, it's maybe like 5% of the pixels are just removed. So everywhere you see white just means that we just remove the value, and our job is to fill it in. And it's obvious what to do, right? If you're in an area like that, where that N is, of course you want those pixels to be orange, right? So, I mean, you could code something up and it wouldn't look too bad, by the way, most of the things you would do, there would still be an artifact and you would actually even be able to read this stuff. It wouldn't look as bad as if they're all white, but you'd still be able to read it. So here's what simply solving this optimization problem does. So that's the recovered one on the right.
Does Optimization Actually Solve Inversion Problems?
From where you're sitting, I don't think you can tell any difference. And you might ask all sorts of crazy questions like this. Like, first of all, it's crazy that you can't even read. There's no artifacts from the letters. But if you look at that, you might ask, wow, did it actually get it exactly right? And the answer is, it did not. So it guessed and it imputed values, and they're not the exact values. What it did do was this, it imputed values that, to humans like us, look like a real image. That's what it did. And they were not the real ones, right? But they simply look good enough. So it's actually pretty impressive. By the way, these days you can do way better than this, but still, okay? The next one is from AI and supports Vector Machine.
And I suspect a lot of people here know a lot about this or actually have used these or have built SVM models and things like that. But this is used for various things like fraud detection, various things whether someone's going to click on an ad or something like that. And the data is a bunch of pairs of a feature vector and then a boolean outcome. And your job is to choose a W and a V. That's a vector. These are the parameters in the model. So this is one of those examples where the optimization variables are parameters in a model. They're not actions, okay? And SVM says, you should minimize this function here. This makes total sense here. I should mention though, this is a plus, and that's the hinge loss, goes like that, and it's non-differentiable.
So, maybe I should have said that earlier. But if you took an old school optimization class, the whole thing was about gradients and all this kind of stuff, right? That was your old school. Like the first class starts with, oh, here's the gradient method, and then you have Newton method, Quasi-Newton method blah, blah, all this stuff, right? So for Convex Optimization, differentiability is completely and totally irrelevant. Just doesn't matter at all. So that also means that if you took that old school class, it was totally useless for this purpose. So sometimes I tell students in class that, and then some people are like, really? Like, you mean that that class, I hated that class anyway. Okay, so this looks something like this.
How to Classify Things with Many Features
This is a case where we have two features and here's some positives and some negatives. And then this would be the SVM class. Now, this is silly, right? Because the way you classify things with two features is you do a scatter plot like this, and you take out a pencil and you draw your decision diagram by hand. That's the correct way to do it. The point is, the same method works though, if you have 20,000 features or a billion, okay? But this is just to give you a picture, okay? Another thing which is something that's, again, it's something that traces way back in history, but has actually become very, very popular in the last, I don't know, 15 years. It's an idea. And the idea is very simple. It says this, if you add a multiple of the sum of the absolute values of the variables to your objective, that means please make X small, but it actually has an additional thing.
The Modern Way of Selecting Features
What it tends to do, when you minimize this, is that many of these Xs will actually be zero. Okay? So that's actually interesting. That means it's sparse. That means all sorts of crazy stuff, right? For example, in Machine Learning, that means that you can do not only joint fitting and feature selection. And this is actually, if you think, I mean, this is actually the way modern Machine Learning works, right? I mean the old school Machine Learning used to go like this, or statistics, right? It would go like this. The hard job is for you to drop into some application. You hang out with the experts for a while, and then after a while you ask the experts, give me the 15 critical features that I should be using. And they'll say, ah, okay, it's the logarithm of this blood/gas concentration, and it should be the ratio of these two things.
And you go, thank you. Or it should be the number of days since the last high, or the average of the volume of that asset in the last week or something. So you hang out with the experts, you figure this out. Once you have the 17 features, you just turn the crank and you get a model, right? So, it doesn't work that way anymore. And a lot of that is because of this. So the way people do it now is you throw everything you might imagine is relevant in, and the point is, these kinds of things, we'll just sort it out and pick the features that are relevant, and they'll pick it based on the data, not based on the expert. By the way, you should still throw in all the features the expert suggests.
So actually for no reason other than this, if you do that, you can't do worse than the expert, which is extremely embarrassing, right? So just for no reason other than saving yourself some embarrassment, you should throw in their features. And then, by the way, if it's only slightly better, you say, well, okay, you had some pretty good features there, you had a pretty good model, great. We did a little bit better. Something like that. Okay? So this is used in a lot of fields. Okay, so I'll mention one LASSO. So this is actually something from the nineties, but has grown and grown and grown. It was one of the first widely popularized methods for sparse model construction, right? And so, in fact, the person who popularized it is Rob Tibshirani, and he'll be here tomorrow speaking.
What If You Don't Know All the Parameters of a Problem?
He's a colleague of mine at Stanford. It's cool because it combines both the most old school regression, right? It's just least squares, it's a regression. So it says it's basically sparse to regression, okay? And it's actually now widely used across a lot of fields and it's actually pretty impressive. I'll say a little bit about this. One thing I will mention is that this is generally not taught. Not yet in undergraduate courses anyway. And the reason that I think's partly intellectual snobbery and the argument there is that there's no analytical solution to this, okay? So there's an analytical solution when you do least squares and people think that's fancy or something, I mean, that's silly, right?
But okay, here's the weird part. The cost of solving that is no different from the cost of solving that. And that's not obvious. And in fact, that's all that really matters. So, okay we're going to do a quick example. So here's a case where I have a thousand features. I have 200 examples. So that means you are one fifth of the minimum number of measurements I could give you for you to say something sensible about the parameters. I mean, this would be like someone walking up on the street and saying, oh, I'm thinking of like five numbers. They add up to 12. You're like good, good for you. And then they say, what are the numbers? And you go, like, how the hell would I know, I don't know.
So it's hideously underdetermined, right? Okay. So sure enough since you're five to one under determined. Oh here's the secret. The secret is that the true X is sparse, but you don't know what the non-zeros are. Okay? So here's the true X and here's your L2 reconstruction. Someone says you did a crappy job, and you say, well, excuse me, you gave me one fifth of the minimum number of measurements I needed to get those things. Okay? If you do LASSO, this is what you get. And I know that maybe a bunch of you do this. And after a while you just sort of get used to this and you go yeah, whatever. But when I hear people say that, I'm saying, could you just sit down and take a moment to sit there. Just like a silent moment to reflect on how cool this is?
So what happened was I gave you, it's ridiculous. It's as if you had an answer to the person who said, I'm thinking of five numbers that are up to 12, it'd be as if you had an answer, right? So it's crazy. By the way, one of the first packages for this stuff was called L1-Magic. So that was a while ago. But now it's used just everywhere. And then students these days who do the Machine Learning courses and stuff like that, they see this and they go whatever. And I'm like, you're not impressed by that. And they go, everybody knows that . And I'm like, dude, this is actually super impressive. Do you not think about that? And they're like, no, it's not. Everyone knows that. I mean okay, fine.
How to Solve Problems with Missing Features and Parameters
Okay, so I'll tell you a little bit about how you solve these problems. If you have thousands to tens of thousands of variables, I can solve these on my laptop. It's just not a problem. You exploit problem sparsity. It's not quite a technology, but it's getting there. So you're getting close. Something that is new and probably much more relevant, probably relevant to some people here, very relevant, is this.
Domain-Specific Languages in Convex Optimization
There are actually now domain specific languages for doing Convex Optimization. And the way that works is you describe the problem in a high level language. That is automatically compiled into a standard form, which, in fact, the whole point is the vast majority of people using it don't even have to have a clue of what it does or how it works, right?
And then that's solved by a standard solver and transformed back. So it's actually beautiful because basically what it is, is it becomes a declarative language which I'll show you a snippet in a minute. So this allows you to do rapid prototyping for small and medium problems. And for teaching, it's unbelievable. I mean, I teach a class on this, and it's a giant class and we have people from 25 departments, and it's actually really cool to talk about this stuff. And then we take the class and then by the time that you get out of that class, we won't have talked about Machine Learning control, signal processing, finance, or medical imaging. Everyone will have done it. I mean, they'll write a five line script in Python or something like that. And so it's actually really fun.
What is CVXPY and How Does it Help Convex Optimization?
Okay? So CVXPY, that's a relatively recent one developed by someone who was actually an undergraduate at the time. And so this would be the Python specification for SVM. And this would do the trick right there. So you'd just type that in and then you'd form a problem and then you would call the solve method on it and it would solve it. So it's basically a declarative language. It means basically, you focus on asking for what you want, not how to get it. And that is in fact the whole point of optimization, right? By the way, you don't think about that a lot, but all these methods in Machine Learning have shifted, right? If you look at some data, if you had to come up with a coefficient in your classifier, whatever, it'd be a huge pain.
I mean, you could do it maybe, but oh, not really. So, this is CVXPY and this is what it would look like. Okay? I'm now going to say a little bit about some stuff that's newer, but actually a lot of times when people say stuff that's new, it still traces back to the fifties. And this is no exception. So these are ideas that trace back to the fifties. There's a lot of work in the seventies and eighties. They're super interesting right now because a lot of these things work well, for example, on GPUs or in distributed implementations, right? So okay, so I'll say a little bit about that. Well, the idea is that you want to do really large scale optimization stuff with maybe a billion variables, 10 billion, maybe more.
Dynamic Optimization Versus Convex Optimization
And some obvious examples would be just Machine Learning with huge data sets. One would be dynamic optimization on large scale networks. Like for example, I could schedule all of California's 3,100 generators and whatever it is, the 7,000 major transmission lines, I could schedule that in 15 minute increments for tomorrow. In fact, people do something very close to that. That's not a small problem just for the record, right? Because every transmission line and every 15 minute interval has a power flow. Every generator has, it's all linked in all sorts of other stuff. So it's complicated. And of course if you get into things like image and video processing then that's automatically, you're in the large data area, right? So actually my friends who do things like video processing are very pissed.
They're really pissed off when they hear people say big data because, for obvious reasons, they're like well, what do you think we do? And we never went around and said, oh, we do big data. So it's like, okay, fine. Alright, so the standard method for solving these gigantic problems, actually, I'll tell you the real method. There's a secret. The real method is you try to do all your feature engineering or whatever it is if you're doing Machine Learning, you try to do that. So it'll fit on a single GPU. That's actually the correct way to do it. Oh by the way, which then we'll use distributed optimization, which I'm going to talk about. But nevertheless, okay so here's what we do. So the idea is you're going to split the problem up into a bunch of parts, and then what you're going to do is solve all these problems independently.
Solving Real-World Problems with Dynamic Optimization
Now of course, that's not the solution to the problem because they're all connected, or it could be that you have an opinion about a variable and I do too. And if they're not consistent, then we haven't solved the problem, right? So the way that works is we'll then iterate. You could even say lots of things. You can anthropomorphize this and say that we'll all solve a problem and then we will negotiate with each other, right? Meaning, if you think there's a power flow that should be three gigawatts and I think it should be two, then maybe on the next step we'll even each move towards the others or something like that. So the idea of distributed optimization goes way back. And I'll tell you about what is a standard form for it. It's very easy to understand.
How Dynamic Optimization Allows Different Models to Work Together
It's really pretty cool. It's this. You just want to solve a problem with N objective terms. So you want to minimize a sum of functions. And if you want to think about this in Machine Learning, here's a perfect way to do it, is that this is N data stores and each one is a petabyte or whatever. It doesn't matter. It's a big data store. And then X is the statistical parameters in your model that you want to fit. I don't care. I want to do Logistic Regression with a hundred different data stores, each very large. Now, one way to do it is to collect all the data into one place and then run something, right? But suppose I don't want to, or I cannot, or it's not needed. In fact, the whole point of this is it's not needed.
So the idea is that each of these data stores will fit their own model to their own data, and then they'll exchange messages and then they'll fit it again. And this'll go for a while until you will actually create the model that you would've had, had you collected all the data in one place. So that is the idea. It's pretty straightforward. Okay? So there's several methods to do this. There's a method that's been in vogue from, it's exactly from the 1970s. And then was put back into vogue a couple of years ago people started using this quite a lot. And it basically says that, entirely in parallel, each one should minimize its own plus, and then there's a term here, which is what enforces the coordination.
Convex Versus Non-Convex Optimization Problems
It's what causes me on my next step to be closer to what you think it is and for you to move, for us to move closer to consistency. What's cool about it, although the algorithm's completely reasonable, you can understand every part of it. It makes total sense. What's not clear is that it always works. So guess what? It always works. Actually, if the problem is convex. If it's not convex, people run it all the time too, in which case no one knows if it works. But that's fine, because if you're solving a non-convex problem, you don't know if it works or not too. And then finally, the final comment on that is, it doesn't matter because your job was not to minimize a function. Your job was to get a good model that comes up well on test data, okay?
Small Scale Consensus SVM Example
So, alright. So that was long but weird, but, okay, fine. So we'll do a quick example. We're going to do a consensus Support Vector Machine. So I have a tiny problem with two features and 400 examples. This is silly. I can solve that problem in about 60 microseconds on my laptop, actually on my phone in not a lot longer. Okay? So that's a tiny little problem. But this is just to illustrate how this is going to work. And the way we're going to do it is we're going to take these 400 data samples. Let's make it a spam filter. I mean, it doesn't matter, right? I'll take them and I'm going to divide them out to the 20 groups in an extremely evil way. So each group is going to get all positive or all negative examples. Everybody got that?
The Dangers of Incorrect Data Allocation
So each group has an unbelievably skewed version of the world, right? So if I give you a data set and here's what they are, you don't have to, doesn't matter what the feature is, right? And I say, here's your data set ready. Here are the outcomes ready. True, true, true, true, true, and so on, right? It's very easy to build a model based on that data. But someone else, I give the data and it's like, false, false, false, false. Okay, everyone got it? Okay. So they have to coordinate or it's not going to make any sense. Okay? So here it is. Here's the data. These are the original models, and of course they're terrible, but of course they're terrible because we have given, we've done the opposite of hashing.
We've done the worst thing you could possibly do, and we've distributed the data in a totally evil way. And so they're all terrible. Weirdly by the way, the average of all of them, which is the first iteration, is not bad. And you do five iterations of this consensus method, and then you do 40 and you get this right? So now of course it's not impressive. This is a problem I could have solved on my phone in under a millisecond, right? But of course, the same thing works for much larger scale problems and in fact works on GPUs and things like that. And it's the basis of a bunch of stuff. Actually it was the basis of the first demo that Sri put up. When you saw the red and the green bars. All the heavy lifting was actually ADMM running to fit models in that case.
Convex Optimization in Medium and Large Scale Problems
Okay? So I'm going to give a summary. So Convex Optimization problems arise in a lot of applications in a lot of different fields. They can be solved effectively. So if it's a medium scale problem using general purpose methods. Small scale problems are solved at microsecond and millisecond time scales. I didn't get to talk about that, but in fact that's how they're used in control and areas like that. By the way, the second issue there is they have to be solved, not only at microsecond or millisecond timescale, but they have to do so with a reliability that is perfect. Like, as in, it can not fail, right? Because, if you think carefully about it, depending on what the application is, your classifier misclassified something, usually almost no one's going to die, right?
In control that's not the case, right? So, okay now in large scale problems, distributed optimization will work. And that's actually used now it's in production at a bunch of companies. That's how they build a lot of big models that way they'll separate wherever the data is, if it's widely separated, they'll actually do these distributed model building things. But, maybe, probably the most important thing is these domain specific languages make prototyping easy. So that's actually a big deal and very cool. So this is just sort of, you don't need this anymore, right? If you want to find out more about this, you don't do this. What you do is you go to Google and you type anything remotely close to any of these terms in and you'll find everything. So that is the right way to find out about this stuff. And even if you're not that interested in it, it's probably good for you to know. Because you may hear this. Sometime, somebody will be blabbering about convexity or something, and it's not unusual. So it's good to know what it's about. And it's about nothing more than this. It's a bunch of Mathematical Optimization problems we can solve. Exactly. Period. That's it. It's not a big deal, but that's it. Thanks. Thanks. So I'll be very happy to take questions if people have them.
Are There Any Good Open Source Optimization Solvers?
Moderator:
Sure. Yeah. So we've got a few minutes before lunch. This is a great time for me to go ahead and introduce our question answering platform. You should have gotten some flyers on your seat. It's really just an application. You go to slido.com. Enter the event H2OWORLD. For the folks that are watching out there in the stream, you can do it as well. And what's going to happen is, I have a few moderators that can select the questions. And you can also, if you log in, you'll see questions that other people asked. And if you like it, we'll see how often you like it and we can pick one. So if there's any questions for Professor Boyd, please go ahead and ask them. And maybe I'll kick off one. Perfect. I had to use Professor Boyd's textbook in graduate school. You talked a little bit about some solvers. Are there particular solvers, and not necessarily the API level, but the actual engine that solves. The optimization problems that you can talk about which ones you like for what kind of problems?
Stephen Boyd:
Yeah, so my group has been working, in the last maybe five years or so, on building out the open source ecosystem for solvers that, at least for some solvers, I am sort of quasi-sorry to report, unfortunately right now some of the best ones are commercial. Sorry about that. We're working to make sure that there are liberally licensed and open source solvers available. And there's actually, I mean, we're making progress on that. And they're usually divided into the ones that work for embedded systems. So that's things you'd use in control and things for very large scale systems and things for a bunch of them are like single threaded, so they're not for that, but that's another story. So there's development going on. We're hoping to make sure that in the open source ecosystem there are very, very good solvers available.
Moderator:
Excellent. Yeah. Well, if you look at a screen there's a few questions. Okay. You got a few likes and feel free to answer which one you want to.
Does Each Subset of Data Need to Have all the Features?
Stephen Boyd:
Wow. Let's see. Let's see. I'll do the easy one. Do you mind? Okay, I'll pick one here. For Large Scale Distribution, does each subset of data need to have all the features? And the answer is no, it doesn't, right? So you can split across features, you can split across data and things like that, or you can split across both. And these things would work just fine. And in fact, that's what people actually do. Will these slides be available? Thanks. Great talk. So, yes they're, I've given a talk like this, actually a longer one, at a whole bunch of places. And my suspicion is with about 20 seconds of Google time, you'd find videos of a talk that was similar to this one. Yeah.
Moderator:
And for those of you who don't know, Professor Boyd has a very popular MOOC that's online. So you can get it from iTunes or go to Corsair or something. And a lot of the contents there as well.
Can Convex Optimization Be Applied to Deep Learning?
Stephen Boyd:
Ah, okay. So for deep learning, yes. Oh, I'm so happy because this is, I've never yet, I haven't given a talk in the last few years when that question wasn't asked. Can convex optimization be applied to deep learning? As far as I know, no. So it's great. No. So no. Every now and then people claim it can be or something like that, but generally no. There's another question. It's a very good question. One question is, people always think of these convex methods, which you're doing Machine Learning. I mean, we like to call them shallow methods. I like that, right? So that's things like SVM and things like that and these whatever GLM type models. So people often think there's some antagonism between that and deep learning, and it's in fact the exact opposite.
That deep learning, these are exquisitely good methods for generating, from the data, good features, they're exquisite. Now, my view is at that point, you would probably use a shallow method on top to do whatever you were going to do. But so they actually play very nicely together. Okay. What's an open problem in optimization, particularly relevant to Machine Learning? So that's a question I get often. To be honest with you, I think that optimization, I mean, you could, if you think of it as sort of a mathematical thing. I'm not sure that there are any real open problems or some giant mathematical theorem that's going to solve the world or something like that. I actually think if it's more, like right now it's a technology question, right? So probably the real question is are there good solvers that are compatible with TensorFlow or that solve these kinds of problems or they will gimme modest accuracy quickly or something like that.
So I actually think it is more important than the theory. I mean, even though that's what I do. But it's probably much more important what the implementations are. Here would be one, are there ones that are airtight reliable, for example? That's an important one.
How do you Formulate Convex Problems for Complex Phenomena
Okay. Let's see. How do we formulate convex problems for complex phenomena? That's a great question. So the way we do that is we first, in the class I teach, right? Everyone's in a sandbox. We pretend everything is convex, right? So we just pretend that, and that's a good way to learn. So then that's with your training wheels on, we'll give them problems. They're all exactly convex, but we're hinting. Once you've mastered the basics, then we turn to street fighting tricks.
Some of them are simple, right? Like if you see something that's got abandoned, it's not quite convex, like a fuel use curve that's not quite the way, it's not convex. Yeah. You just draw a line through it and say, okay, I'm going to do that instead or something. But the weird part is a lot of these approximation methods work well. They work shockingly well, even when the approximations are horrendous. So it's really, and this is something that's played out through history over and over again. Good example is SVM. So, Support Vector Machine, I can describe it exactly this way. The original problem is, I walk up to you and I say, here's a bunch of data, please fit me a separating hyperplane that misclassifies as few points as possible. I mean, what could be more reasonable than that?
I'd say, here's my data. Make me a linear model that has as few misclassifications as possible. Okay, fine. That's not a convex problem. And in fact, nobody can solve it. Okay? So then you make a ridiculously inaccurate approximation, and that's SVM. Then it turns out, and this is by the way, we have a name for this. It's called analysis paralysis. This is where people are like, oh, oh, you can't prove anything. That's like, oh, what have you done? That's not right. Anyway, if you just simply do that, you actually get outstanding classifiers, right? So there are many, many examples like this. So those are the street fighting tricks. All right, folks, let's give Professor Boyd another hand. Thanks.