Showing posts with label SOCP. Show all posts
Showing posts with label SOCP. Show all posts

Monday, March 16, 2015

Machine Learning with MOSEK Fusion: linear soft-margin Support Vector Machine

Machine-Learning (ML)  has become a common widespread tool in many applications that affect our everyday life. In many cases, at the very core of these techniques there is an optimization problem. In this post we will show how one of the most common ML technique, the Support-Vector Machine (SVM) is defined as the solution of a conic optimization problem.

The aim of this post is two-fold:
  1. To show that simple SVM models can be efficiently solved with a commercial general-purpose solver as MOSEK
  2. That the MOSEK Fusion API yield  a very compact and expressive code.
We are not claiming that a general-purpose solver is the tool of choice for solving SVMs problems. There is a huge literature about specialized algorithms, and software packages (see for instance LibSVM) that exploit their peculiar structure. 

In words, the basic SVM model can be stated as:
"We are given a set of points $m$ in a $n$-dimensional space, partitioned in two groups. Find, if any, the separating hyperplane of the two subsets with the largest margin, i.e. as far as possible from the points."

In mathematical form, that means to determine an hypeplane $w^T x = b$ that separate  two sets of points leaving the largest margin possible. It can be proved that the margin is given by $2/\|w\|$.

Therefore, we need to solve the problem of maximizing  $2/\|w\|$ with respect of $w,b$ with the constraints that points of the same class must lie on the same side of the hyperplane. Denoting with $x_i\in\mathbb{R}^n$ the i-th observation and assuming that each point is given a label $y_i\in \{-1,+1\}$, it is easy to see that the separation is equivalent to:

\[
     y_i (w^T x_i -b)\geq 1.
\]

A simple example in two dimension is the following:



It is easy to show (see [1]) that the separating hyperplane is the solution of the following optimization problem:

\[
\begin{array} {lll}
\min & \frac{1}{2} \| w \|^2  &\\
& y_i (w^T x_i -b ) \geq 1& i=1,\ldots,m
\end{array}
\]

If a solution exists, $w,b$ define the separating hyperplane and the sign of $w^Tx -b$ can be used to decide the class in which a point $x$ falls.

To allow more flexibility the soft-margin SVM classifier is often used instead. It allows for violation of the classification. To this extent a non-negative slack variable is added to each linear constraint and penalized in the objective function.

\[
\begin{array} {lll}
\min_{b,w} & \frac{1}{2} \| w \|^2  + C \sum_{i=1}^{m} \xi_i&\\
& y_i (w^T x_i -b ) \geq 1 - \xi_ i& i=1,\ldots,m\\
& \xi_i \geq 0 & i=1,\ldots,m
\end{array}
\]

In matrix form we have 

\[
\begin{array} {lll}
\min_{b, w, \xi} & \frac{1}{2} \| w \|^2  + C \mathbf{e}^T \xi\\
&    y \star ( X w - b \mathbf{e})  + \xi \geq \mathbf{e}\\
& \xi \geq 0
\end{array}
\]

where $\star$ denotes the component-wise product, and $\mathbf{e}$ a vector with all components equal to one. The constant $C>0$ acts both as scaling factor and as weight. Varying $C$ yields different trade-off between accuracy and robustness.

Implementing the matrix formulation of the soft-margin SVM in Fusion is very easy. We only need to cast the problem in conic form, which in this case only involves converting the quadratic term of the objective function in a conic constraint:


\[
\begin{array} {ll}
\min_{b, w, \xi,t} & t  + C \mathbf{e}^T \xi& \\
&    \xi + y \star ( X w - b \mathbf{e})  \geq \mathbf{e}\\
& (1,t,w) \in \mathcal{Q_r}^{n+2}\\
& \xi \geq 0
\end{array}
\]

where $\mathcal{Q_r}$ denotes a rotated cone of dimension $n+2$.

In our Fusion implementation we will allow the user to specify a set of values for $C$, and we will solve a problem for each one. We will show Python code for sake of simplicity. However, much better performance can be achieved with the Java or C# API.

Let assume now we are given an array y of labels, a matrix X where input data are stored row-wise and a set of values C for  we want to test. The implementation in Fusion of our conic model is the following



The code is so expressive that it deserves very few comments:


  1.  A sequence of problem can be solved just changing the objective function, as in loop starting in line 19: in this most of the model is build only once.
  2. Expression in line 13 shows how a linear expression can be store and reused.
  3. The construct Variable.repeat(b,m)  logically repeats variable $b$, i.e. forms an array of the form $(b,b,b,\ldots,b)^T \in \mathbb{R}^m$.
  4. The solution is recovered by the level method: if an optimal solution hasn't been found, an exception is thrown.

Let's make few tests. We generate two sets of random two dimensional points each from a Gaussian distribution: the first set centered at $(1.0,1.0)$; the second at $(-1.0,-1.0)$. We experiment standard deviation $\sigma$ of $1/2$ and $1$.

With $\sigma=1/2$ we obtain a separable sets of points and for $C=1$ we have  the following




For $\sigma=1$ separability is usually not possible. Larger value of $C$ emphasize good classification, but reduces the margin and hence robustness. For instance, we may obtain the following:



The black and red lines correspond to $C=1$ and $C=2^{10}$.

MOSEK solves this problems in almost no time:





Further improvements can be obtained exploiting sparsity: in most cases input data are supposed to be sparse, i.e. most entries in the X matrix are zero. If this is the case, then Fusion, and therefore MOSEK, can exploit such a property. You only need to feed the function with a Fusion sparse matrix:



where Xi,Xj,Xv are the arrays that store the sparse representation of X in triplet form. This will save memory and reduce the computational burden.


So to summarize: you can easily build your own linear SVM classifier with MOSEK Fusion! Give it a try and play with it. Fusion demonstrates again its simplicity and power of expressiveness.

[1] Cortes, Corinna, and Vladimir Vapnik. "Support-vector networks." Machine learning 20.3 (1995): 273-297.

Wednesday, November 19, 2014

The reason for the 2 in the definition of the rotated quadratic cone

MOSEK employs the definition
\begin{equation}
K^1 := \{ x \mid 2 x_1 x_2 \geq ||x_{3:n}||^2,  x_1, x_2 \geq 0 \}
\end{equation}
for the rotated quadratic cone, Occasionally users of MOSEK ask why there is the 2 in front of the product $x_1 x_2$.  Why not use the definition
\begin{equation}
K^2 := \{ x \mid  x_1 x_2 \geq ||x_{3:n}||^2,  x_1, x_2 \geq 0 \} ?
\end{equation}

The reason is that the dual cone plays an important role and the dual cone of $K^1$ is $K^1$ i.e. it is self-dual. That is pretty! Now the dual cone of $K^2$ is
\begin{equation}
\{ x \mid 4 s_1 s_2 \geq ||s_{3:n}||^2,  s_1, s_2 \geq 0 \}.
\end{equation}
Hence, $K^2$ is not self-dual! That is somewhat ugly and inconvenient. 

To summarize the definition $K^1$ for the rotated quadratic cone is preferred because the alternative definition $K^2$ is not self-dual

A couple of historical notes are:

  • In the classical paper  by Lobo et. al.  the cone $K^2$ is called a hyperbolic constraint in Section 2.3.  
  • MOSEK is highly inspired by the important work of the late Jos Sturm on the code SeDuMi . Now SeDuMi is short for self-dual minimization and for that reason Sturm employs the definition $K^1$ too. 



Friday, March 28, 2014

SwissQuant Math Challenge on LASSO Regression: the MOSEK way

The Swiss company SwissQuant publishes monthly a math challenge in which they call for people to elaborate and discuss problems of different kind, somehow related to the company business. This month they focus on (quadratic) convex optimization, asking people to proposed specialized algorithm for the OLS-LASSO problem that can beat the established commercial code for quadratic programming as MOSEK. You can find details here Math Challenge March 2014.

Clearly we are glad they appreciate somehow our product, but also we feel challenged in tackling the problem itself. MOSEK is not specialized for OLS-LASSO, but still it can be very efficient we thought, especially in dual form. Moreover, we are always eager to find good test problems, for the core solver but also for our APIs.

The OLS-LASSO problem

So let's start from the formulation given in the challenge, where we are given $T$ samples in dimension $m$:

\begin{equation}\label{main}\tag{OLS-LASSO}
  \begin{aligned}
    \min_{u,b} & \frac{1}{2} || u ||^2 + \sum_{j=1}^{m} \lambda_j |b_j|\\
    &\\
    s.t.&\\
    & u = y - X b
  \end{aligned}
\end{equation}

where $b,\lambda\in \mathbb{R}^m, u,y\in \mathbb{R}^T$. We reformulate problem \ref{main} so to have a linear objective function:

\begin{equation*}
  \begin{aligned}
    \min_{u,b,z,w} &\quad z + \lambda^T w\\
    &\\
    s.t.&\\
    & u + X b = y\\
    & s=1\\
    & (s,z,u) \in Q_R^{T+2}\\
    & w_i \geq |b_i|& \forall i=1,\ldots,m.
  \end{aligned}
\end{equation*}

where $Q_R$ is the rotated cone. There are two way to to get rid of the absolute value:

(1) Linear constraints 

\[ w_i\geq |b_i| \Rightarrow \left\{ \begin{array}{l}  w \geq b \\ w \geq -b \\\end{array}\right.\]

(2) Second-order conic constraint

\[ w_i \geq |b_i| \Leftrightarrow w_i\geq \sqrt{b_i^2} \equiv (w_i,b_i)\in Q^2\]


We will see which primal form works the best. But it's the dual that we want to really see at work. Using the standard procedure outlined in our post about duality, we derive the dual of problem (PQ). It is easier than PL, believe us.

First let's fix a partition of the variable set as $[s,z,u,w,b]$, and clearly states the basic vector and
matrices:


(1) cost vector: $c^T= [0 \quad 1 \quad 0_T \quad \lambda^T \quad 0_m]$;
(2) right-hand side: $b=^T[1 \quad y^T]^T$;
(2) matrix coefficients: $A= \left[ \begin{array}{ccccc} 1 & 0&0_T&0_m&0_m\\ 0&0&I_T& 0_m&X\end{array}\right]$;


We then introduce as many dual variables as primal linear constraints:
$\mu_0$ for the first constraint, and $\mu\in \mathbb{R}^T$ for the
second one. We also include additional slack variables $\gamma$
partitioned in the same way as the primal variables, i.e. $\gamma^T=[\gamma_s,\gamma_z,\gamma_u,\gamma_w,\gamma_b]$.

The dual problem is

\begin{equation*}
  \begin{aligned}
    \max_{\mu_0, \mu, \gamma} & \quad \mu_0 + y^T \mu \\
    &\\
    s.t.&\\
    & c - A^T \left[ \begin{array}{c}\mu_0\\\mu\end{array} \right]= \gamma\\
    & (\gamma_s,\gamma_z,\gamma_u) \in Q_R^{T+2}\\
    & (\gamma_w,\gamma_b) \in \Pi_{i=1}^m Q^2.
  \end{aligned}
\end{equation*}

After some calculations, we obtain:

\begin{equation}\tag{D}
  \begin{aligned}
    \max_{\mu_0,\mu,\gamma_z} &  \quad \mu_0 + y^T \mu \\
    &\\
    s.t.&\\
    &  X^T \mu \leq  \lambda\\
    & \gamma_z=  1\\
    & \left(-\mu_0, \gamma_z, -\mu \right) \in Q_R^{2+T}.
  \end{aligned}
\end{equation}

Try to do the same with PL, you'll get the same model.

The results

We implement these three models using the MOSEK Fusion Python API and run a batch of tests for $m=[10,25,75,100]$ and $T$ from $10000$ to $100000$. To make it easy, $X,\lambda$ and $y$ are all uniformly generated in $[0,1]$. The code that interface with the model looks like


Here you can see the running time for problem PL in green, PQ in red and D in blue. In the first  plot we fix $m$ and let $T$ varies:


and then if we fix $T$ either to $10000$ or $100000$ and we get



Again, the dual is again a clear winner, as in a previous post! No way the primal can even get close to the dual.

But it is also interesting to see how  PQ is as good as PL. Red and green lines are almost indistinguishable. Only for larger $m$, i.e. when the number of cones increases, we can see a significant difference. Otherwise the difference is very very small.




Thursday, February 20, 2014

Conic duality in practice: when dual beats the primal...


Despite its simplicity, the problem we have been discussing in the last two posts (see the first and second post) is a nice source of examples and insights about conic programming. The formulation we obtained is:

\begin{equation}
\begin{aligned}
\min r_0&\\
 s.t.&\\
 &\quad ( r_0\mathbf{e}, P-\mathbf{e}p_0^T) \in \Pi_{i=1}^k Q^{(n+1)}.
\end{aligned}
\end{equation}

Recall that $p_0,p_i$ are row vectors and $e$ is a vector with all ones. We might wonder whether the dual problem of (1) is harder or easier to solve. Using Conic Programming duality this can be done pretty easily.

So let's use the duality theory of Conic Programming to derive the dual problem of (1).  First, to derive the dual we restate (1) in a slightly different way:

\begin{equation}
\begin{aligned}
\min r_0&\\
 s.t.&\\
 &\quad ( r_0, p_0^T - p_i) \in Q^{(n+1)} & i=1,\ldots,k.
\end{aligned}
\end{equation}

Following [1], this is our primal formulation and the construction of the dual problem should be quite straightforward! Consider the standard primal form (see pp. 69 in [1]):

\begin{equation}
\begin{aligned}
\min c^T x&\\
 s.t.&\\
 &\quad A_i x_i -b_i \in K_i & i=1,\ldots,k,
\end{aligned}
\end{equation}

where $K_i$'s are cones and $x_i=(r_o , p_0)^T$: problem (2) follows by setting $ c=( 1 , 0_{n})^T$ and

\[
A_i= I_{n+1},\quad b_i= \begin{pmatrix} 0 \\ p_i^T \end{pmatrix},\quad K_i=Q^{(n+1)} , \quad i=1,\ldots,k..
\]

To formulate the dual,  for each cone $i$ we introduce a vector $y_i \in R^{n+1}$ of dual variables such that $y_i$ must belong to the self-dual cone of $Q^{(n+1)}$, i.e. $Q^{(n+1)}$ itself. The dual problem takes the form

\begin{equation}
\begin{aligned}
\max & \sum_{i=1}^k b_i^T y_i\\
 s.t.&\\
 &\sum_{i=1}^k y_i = c\\
 &y_i\in Q^{(n+1)} & i=1,\ldots,k.
\end{aligned}
\end{equation}

which is equivalent to

\begin{equation}
\begin{aligned}
\max & \mathrm{Tr} \sum_{i=1}^k B Y^T \\
 s.t.&\\
 & e_k^T Y = c\\
 &Y \in \Pi_{i=1}^k Q^{(n+1)}
\end{aligned}
\end{equation}

where \[ Y=  \begin{pmatrix} y_1^T \\ \vdots \\ y_k^T \end{pmatrix}, B=  \begin{pmatrix} p_1^T \\ \vdots \\ p_k^T \end{pmatrix}, e_k= \begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}\in R^k.\]

Let's implement the dual formulation (5) using the MOSEK Fusion API. The code looks like

The code is very simple, compact and self-commenting. Note how we also exploit the fact that the first column of $B$ is composed by zeros. What about performance? The dual beats the primal in every sense! Look at the plot below (tests have been done on my desktop machine Pentium(R) Dual-Core  CPU E5700  @ 3.00GHz under Linux) where we report the running time needed to build the optimization model and to solve it.



The dual is much faster both in terms of time spent in building the model, which is indeed much simpler, and also in terms of solver execution time! Both model building scale almost linearly, but also the solution time does. If the former is expected, the latter might come as a surprise! Note however, that model building time matters when tackling polynomially solvable problems....

Final take-away message:

- Solving the dual instead of the primal can be much more efficient!

- The conic formulation helps to derive the dual problem!


[1] Ben-Tal, A., & Nemirovski, A. (2001). Lectures on modern convex optimization: analysis, algorithms, and engineering applications (Vol. 2). Siam.

Friday, February 14, 2014

A nice example of conic optimization modeling

Few days ago, my boss Erling Andersen suggested me to look at this problem to learn the MOSEK Fusion API:
How to find the smallest sphere that encloses a set of $k$ points $p_i \in R^n$.
The intuitive formulation is to minimize the radius r of the sphere centered in $p_0 \in R^n$, i.e.

\begin{align}
& \min r \\
& s.t. \\
& & || p_i - p_0||_2 \leq r, & i=1,\ldots,k
\end{align}

The inspiration came from a recent book¹ where the problem, probably taken from the GAMS repository², is presented as a difficult test for nonlinear local optimizers.

At the same time (what a coincidence!), O'Neil and Puget published interesting post about the Chebyshev centers of polygons...which turns out to be equivalent to what I was looking at! As Puget pointed out, the problem is known to be simple: Megiddo³ showed back in 1982 that it can be solved in linear time by a specialized algorithm.

Anyway, we will use it as an example of conic programming modeling example. First we introduce auxiliary variables such that  $w_i = p_i - p_0$, yielding

\begin{align}
& \min r \\
& s.t.\\
& & || w_i ||_2 \leq r, & i=1,\ldots,k\\
& & w_i = p_i - p_0, & i=1,\ldots,k
\end{align}

Constraints (6) define quadratic (Lorentz) cones $Q^{(n+1)}$, i.e.

\begin{equation} || w_i ||_2   \leq r    \Rightarrow (r, w_i) \in  Q^{(k+1)}.\end{equation}

For the second order cones to be disjointed, we introduce auxiliary variables $r_i, i=1,\ldots,n$ that must be equal to $r$. The final formulation of our problem is

\begin{align}
& \min r\\
& s.t.\\
& & r_i = r,  & i=1,\ldots,k\\
& & w_i = p_i - p_0, & i=1,\ldots,k\\
& & (r_i,w_i) \in  Q^{(n+1)}, & i=1,\ldots,k
\end{align}

which is a fairly simple conic quadratic optimization problem!

In the next post we will show how to easily implement this problem using the Mosek Fusion API.

¹ Neculai, Andrei, "Nonlinear Optimization Application Using the GAMS Technology", Springer 2013
² http://www.gams.com/testlib/libhtml/circlen.htm
³ Megiddo, Nimrod. "Linear-time algorithms for linear programming in R3 and related problems." Foundations of Computer Science, 1982. SFCS'08. 23rd Annual Symposium on. IEEE, 1982.

Thursday, March 1, 2012

MOSEK version 6.0.0.135

MOSEK version 6.0.0.135 is available for download at the MOSEK website.

The most important new feature in this release is support for conic quadratic optimization (aka. SOCP) in the AMPL interface. Hence, it is now possible to specify conic quadratic constraints directly in AMPL using a MOSEK specific suffix named cone. Please see the AMPL documentation for details.

The other major changes are:
  • Fixed a number of bugs in the mixed integer optimizer.
  • Fixed some bugs in the AMPL interface. 
  • Fixed a bug occurring under special circumstances in the interior-point optimizer when multiple threads are employed.
  • Fixed a bug in the dual simplex optimizer.