Friday, March 28, 2014

SwissQuant Math Challenge on LASSO Regression: the MOSEK way

The Swiss company SwissQuant publishes monthly a math challenge in which they call for people to elaborate and discuss problems of different kind, somehow related to the company business. This month they focus on (quadratic) convex optimization, asking people to proposed specialized algorithm for the OLS-LASSO problem that can beat the established commercial code for quadratic programming as MOSEK. You can find details here Math Challenge March 2014.

Clearly we are glad they appreciate somehow our product, but also we feel challenged in tackling the problem itself. MOSEK is not specialized for OLS-LASSO, but still it can be very efficient we thought, especially in dual form. Moreover, we are always eager to find good test problems, for the core solver but also for our APIs.

The OLS-LASSO problem

So let's start from the formulation given in the challenge, where we are given $T$ samples in dimension $m$:

\begin{equation}\label{main}\tag{OLS-LASSO}
  \begin{aligned}
    \min_{u,b} & \frac{1}{2} || u ||^2 + \sum_{j=1}^{m} \lambda_j |b_j|\\
    &\\
    s.t.&\\
    & u = y - X b
  \end{aligned}
\end{equation}

where $b,\lambda\in \mathbb{R}^m, u,y\in \mathbb{R}^T$. We reformulate problem \ref{main} so to have a linear objective function:

\begin{equation*}
  \begin{aligned}
    \min_{u,b,z,w} &\quad z + \lambda^T w\\
    &\\
    s.t.&\\
    & u + X b = y\\
    & s=1\\
    & (s,z,u) \in Q_R^{T+2}\\
    & w_i \geq |b_i|& \forall i=1,\ldots,m.
  \end{aligned}
\end{equation*}

where $Q_R$ is the rotated cone. There are two way to to get rid of the absolute value:

(1) Linear constraints 

\[ w_i\geq |b_i| \Rightarrow \left\{ \begin{array}{l}  w \geq b \\ w \geq -b \\\end{array}\right.\]

(2) Second-order conic constraint

\[ w_i \geq |b_i| \Leftrightarrow w_i\geq \sqrt{b_i^2} \equiv (w_i,b_i)\in Q^2\]


We will see which primal form works the best. But it's the dual that we want to really see at work. Using the standard procedure outlined in our post about duality, we derive the dual of problem (PQ). It is easier than PL, believe us.

First let's fix a partition of the variable set as $[s,z,u,w,b]$, and clearly states the basic vector and
matrices:


(1) cost vector: $c^T= [0 \quad 1 \quad 0_T \quad \lambda^T \quad 0_m]$;
(2) right-hand side: $b=^T[1 \quad y^T]^T$;
(2) matrix coefficients: $A= \left[ \begin{array}{ccccc} 1 & 0&0_T&0_m&0_m\\ 0&0&I_T& 0_m&X\end{array}\right]$;


We then introduce as many dual variables as primal linear constraints:
$\mu_0$ for the first constraint, and $\mu\in \mathbb{R}^T$ for the
second one. We also include additional slack variables $\gamma$
partitioned in the same way as the primal variables, i.e. $\gamma^T=[\gamma_s,\gamma_z,\gamma_u,\gamma_w,\gamma_b]$.

The dual problem is

\begin{equation*}
  \begin{aligned}
    \max_{\mu_0, \mu, \gamma} & \quad \mu_0 + y^T \mu \\
    &\\
    s.t.&\\
    & c - A^T \left[ \begin{array}{c}\mu_0\\\mu\end{array} \right]= \gamma\\
    & (\gamma_s,\gamma_z,\gamma_u) \in Q_R^{T+2}\\
    & (\gamma_w,\gamma_b) \in \Pi_{i=1}^m Q^2.
  \end{aligned}
\end{equation*}

After some calculations, we obtain:

\begin{equation}\tag{D}
  \begin{aligned}
    \max_{\mu_0,\mu,\gamma_z} &  \quad \mu_0 + y^T \mu \\
    &\\
    s.t.&\\
    &  X^T \mu \leq  \lambda\\
    & \gamma_z=  1\\
    & \left(-\mu_0, \gamma_z, -\mu \right) \in Q_R^{2+T}.
  \end{aligned}
\end{equation}

Try to do the same with PL, you'll get the same model.

The results

We implement these three models using the MOSEK Fusion Python API and run a batch of tests for $m=[10,25,75,100]$ and $T$ from $10000$ to $100000$. To make it easy, $X,\lambda$ and $y$ are all uniformly generated in $[0,1]$. The code that interface with the model looks like


Here you can see the running time for problem PL in green, PQ in red and D in blue. In the first  plot we fix $m$ and let $T$ varies:


and then if we fix $T$ either to $10000$ or $100000$ and we get



Again, the dual is again a clear winner, as in a previous post! No way the primal can even get close to the dual.

But it is also interesting to see how  PQ is as good as PL. Red and green lines are almost indistinguishable. Only for larger $m$, i.e. when the number of cones increases, we can see a significant difference. Otherwise the difference is very very small.