Day 2: Concepts and Language

Authors: Nauka

Started: Last Modification: 2022-12-02 , 1998 words, est. reading time: 10 minutes

Edit from the future: I got covid on day 4 of writing this, which means I will move the schedule to “as soon as done”.


What is combinatorial optimization? Prof. Schrijver (Schrijver 2003) concisely summarizes it as

Combinatorial optimization searches for an optimum object in a finite collection of objects.

which is of course correct (Prof. Schrijver forgot more about Combinatorial Optimization than I have ever known) but might also raise some questions:

  1. As we are breathlessly reminded everytime OpenAI,Deepmind etc. release a new model, deep learning routinely solve optimization problem over infinitely many points in \(\mathbb{R}^n\) with very large \(n\). How could there be an issue in optimising over finite objects?
  2. Isn’t this tautological? Optimization means looking for an optimum? What does this mean?

Where 1) might be your reaction if you heard about optimization and deep learning but didn’t study computer science per se while 2) might be your reaction if technology doesn’t form a core part of your identity.

What is optimization?

In our modern world, optimization generally refers to having a computer find a setting which solves a problem we care about. Optimization theory is then the body of work that tries to invent and study problem formulations and procedures these computers can implement - algorithms - to solve these problems. We will go into the history of optimization and in particular combinatorial optimization tomorrow, but for today, it’s important to appreciate the fact that the jargon of this study of has been seeping into everyday language for almost 60 years now (check out this google ngram).

So a lot of the language we take for granted actually derives from the fact that these things became a lot more popular with the advent of computing machines which could efficiently perform these. Newton, Gauss and other pre-computer mathematicians tended to think in terms of finding solutions or maybe extrema, and early luminaries like George Dantzig or Leonid Kantorovitch used the mathematical term “program” as in “Linear Programming”.

For Newton, Cauchy etc., finding solutions was what mattered: you had a mathematical question (or example, if I put two sticks with length \(a\) and \(b\) into a \(90°\) angle, how far are they apart?) and you either find or don’t find a closed form solution (Answer: \(\sqrt{a^2+b^2}\)) which you can or cannot compute. Optimization, finding an optimum (or at least getting close to one) is really a notion which starts to make sense once you have problems for which many possible “solutions” exist, but they differ in quality and/or finding the closed form solution is tedious - Cauchy derived gradient descent, a workhorse of modern optimization algorithms as a why to quickly refine an initial guess and then use Newtons root finding method to finish the job.

What do we optimize?

So how do we do this? Well, we first find a formal way of writing down our problem. For this we use functions, which are simply fixed rules that map one value to another. What these values are is up to you, but if you grew up in a country with a working school system, you might have seen symbols like \(\mathbb{N},\mathbb{Z}\) and \(\mathbb{R}\) already, which stand for the natural (all strictly positive whole numbers,without zero), integer (all whole numbers, negative and zero included) and real numbers (the continuous number line, all numbers you need to care about unless it’s your job to know the others, includes even most of the weird ones like \(\pi\) and \(e\)), which are usually used to describe the domain of the function - the numbers that can go in - and the range - the values it can take.

For example, a function of the form \(f:\mathbb{R} \rightarrow \mathbb{Z}\) maps the real numbers to the integers and could describe any operation followed by rounding, truncation or operations like it. Or you might have functions of the form \(f:\mathbb{R}^2\rightarrow \mathbb{R}\) and \(f:\mathbb{R}\times\mathbb{R}\rightarrow \mathbb{R}\) which are two ways of writing a function which takes multiple inputs and maps them onto a single number. The first notation emphasizes the fact that we have \(2\) dimensions of the same type while the 2nd emphasizes the fact we have \(2\) separately varying inputs.

There are many fancy terms like injective, surjective and bijective, continuous,smooth, convex etc., but all of these are then already talking about what types of functions we are working with. The main concept you should take away now is that we can write down a problem (like finding the minimum energy configuration of a physical system) and write it in the following form:

\[\begin{align} \min_\theta &f(\theta)\\ \text{s.t. }&g(\theta)\leq 0 \end{align}\]

where \(f:?\rightarrow \mathbb{R}\) and \(g:?\rightarrow \mathbb{R}\) are our loss and constraint functions respectively and \(\theta\) is the domain we work in. For the physical system, \(f\) might be the total energy in a system, \(\theta\) the configuration of the electrons and atoms making up the system as well as their motion, temperature etc. and \(g\) might be some constraints we need the system to follow (e.g. enforcing atoms to occupy separate places in space). With enough creativity we can fit a lot of problems into this template, and a lot of optimization effort is in finding a good formulation.

One of the key choices in modeling your problem is in the domain we optimize over, where the current language of analysis has converged on using sets by default. The real numbers are an uncountable infinite set, the integers are countably infinite and a restricted subset (say, the numbers between \(0\) and \(100\)) is simply finite. Intuitively you might think that these get progressively easier to optimize over, but in the next section we will explain why that is in fact wrong or at least not always the case.

Why would optimising on a finite set be more difficult than on an infinite set?

A lot has to do with the interplay between domain, \(f\) and \(g\). If you think of the simplest possible real world problem, it might be the question of a craftsperson deciding what to produce. Given a list of \(n\) options, each requiring a certain amount of \(K\) resources (which we can write as a vector \(a_i \in \lbrace a:a\in \mathbb{R}^K \land a_k \geq 0\rbrace\) for \(i\in [0,\dots,n-1]\)), what amounts \(x_i\) should they produce for each product \(i\) to maximise their income given prices \(c_i\) and their resources \(r \in \mathbb{R}^K\)?

They want to find

\[\begin{align} \arg\min_x -c^T x\\ \text{s.t. }Ax \leq r \end{align}\]

or equivalently \[\begin{align} \arg\min_x -c^T x\\ \text{s.t. }Ax-r \leq 0 \end{align}\]

Where we form \(A\in\mathbb{R}^{k\times n}\) by combining the \(n\) resource cost vectors of length \(k\) into a \(k\times n\) matrix.

For solving this problem I refer you to the Simplex, Interior Point, Augmented Lagrangian methods (for small \(n,k\) you can even solve it by drawing lines). The point I want you to focus on, is that in the problem formulation I did a bit of sleight of hand: we implicitly assume we can create arbitrarily fine slices of wares and sell them. While this might be true to some degree if you are working with liquids, realistically, there will be a finite amount \(\epsilon\) which is the smallest amount you can cut your raw materials into or sell. This means that the problem really should be written in the following (eqivalent) forms:

\[\begin{align} \arg\min_{x\in \mathbb{Z}^n} &-c^T x\\ \text{s.t. }&Ax-r \leq 0 \end{align}\] \[\begin{align} \arg\min_{x\in \mathbb{R}^n} &-c^T x\\ \text{s.t. }&Ax-r \leq 0\\ &\text{rem } x /\epsilon = 0 \end{align}\]

where the 2nd form already gives a hint: we are solving the same problem with more constraints, i.e. it becomes more natural that it becomes harder.

The hardness comes from the fact that we are forced to take steps of at least size \(\epsilon\) where previously we had much more control about the exact amounts we produce and sell. In another perspective (and using a bit more jargon), we often exploit some assumptions about the functions and sets we are optimizing/optimizing on - for example, we often assume that \(f\) changes relatively slowly and smoothly if we take small enough steps. But small enough becomes impossible if \(\epsilon\) becomes large enough - hence we can make less assumptions to exploit when looking for a solution. And for a final intuitive explanation: observe the example given here and you will note that while finding the optimum ignoring the integrality constrained is “easy” (you follow the constraints as you try to go higher and higher, which is what Simplex and IP points do mathematically), deciding where to go from the LPOpt point is nontrivial: all you know is you are violating the constraints, but you don’t know where exactly the best feasible point is. Rounding to the nearest integer yields an infeasible point, and while in this particular case exploring the neighbourhood of integers might be feasible, as your dimension \(d\) grows you will have \(2^d\) points to search.

We will get to complexity in the middle of december, but in general, you want your changing parameters (like dimension) in the base position, not the exponential position to solve them (the fancy way of saying this is you want to be in P, not NP). Without further assumptions, integer linear programs like this are in NP, because the exponential will pop up somewhere.

Returning to Prof. Schrijvers definition, the full quote reads

Combinatorial optimization searches for an optimum object in a finite collection of objects. Typically, the collection has a concise representation (like a graph), while the number of objects is huge — more precisely, grows exponentially in the size of the representation (like all matchings or all Hamiltonian circuits). So scanning all objects one by one and selecting the best one is not an option. More efficient methods should be found.

Which captures the crux of the matter: if for some property describing the size of your problem the number of “finite” options grows exponentially, you have the worst of both worlds. You can’t rely on the nice continuity properties of infinite points to ride the constraints to the best value of the feasible set , but you also can’t hope to enumerate all options.

This pops up whenever we talk about discrete things relating to, combining with or being selected from amongst other discrete things, e.g.

and in fact, the classical enumerative combinatorics often provides the language about what structure the search space takes (combinations, permutations, multisets and partitions) as well as the term “combinatorial blowup”, which captures the fact that for any \(k,n \in \mathbb{Z},k,n > 1\), \(\lim_{n\rightarrow \infty} n^k\leq k^n \leq n!\), and usually you don’t need to go very large. Oh, and you can do this meaningfully for Reals too!

And this is what makes combinatorial optimization different: the discrete nature of the domain we operate on forces us to abandon algorithms which exploit continuity while the combinatorial blowup of the search space forces us to come up with new things to exploit.

Tomorrow we will talk a bit more on how this field developed and where some of the classic problems came from.

Happy Holidays! And Слава Україні!


References:

Schrijver, A. 2003. Combinatorial Optimization: Polyhedra and Efficiency. Algorithms and Combinatorics 24. Berlin ; New York: Springer.