Foo, Chuan-Sheng and Do, Chuong and Ng, Andrew
We present a general Bayesian framework for hyperparameter tuning in $L_2$-regularized supervised learning models. Paradoxically, our algorithm works by first analytically integrating out the hyperparameters from the model. We solve the resulting non-convex optimization problem efficiently using a majorization-minimization (MM) algorithm, in which the non-convex problem is reduced to a series of convex $L_2$-regularized parameter estimation tasks. The principal appeal of our method is its simplicity: the updates for choosing the $L_2$-regularized subproblems in each step are trivial to implement (or even perform by hand), and each subproblem can be efficiently solved by adapting existing solvers. Empirical results on a variety of supervised learning models show that our algorithm is competitive with both grid-search and gradient-based algorithms, but is more efficient and far easier to implement.
Discussion
Hi,
You make multiple references to SVM style objectives (L2 regularized) in your paper, but you test your algorithm only on smooth probalistic objective. So my question is basically : what's the catch ? if any
It seems to me, from a practical point of view, that your wrapper could be applied,
but perhaps the majorization-minimization scheme is not so effective for the non-smooth SVM problem ?
(you also lose the direct bayesian interpretation by Gamma prior since the loss function is not (directly) ba log likelihood)
Have you tried it ?
Yes you are right that the catch is that there will no longer be a Bayesian interpretation for the algorithm if it were to be applied to max-margin (SVM) style models. However, the algorithm can also be viewed as a way to optimize a log-L2 style regularizer (e.g. log(1+0.5*||w||^2)), and from this point of view the algorithm is justified. We are not sure if using such regularizers in general will yield benefits over standard L2-norm regularization; this is something that could be further explored.