SVM Optimization: Inverse Dependence on Training Set Size (2008)

Authors

Abstract

We discuss how the runtime of SVM optimization should decrease as the size of the training data increases. We present theoretical and empirical results demonstrating how a simple subgradient descent approach indeed displays such behavior, at least for linear kernels.

Discussion

Barry Drake, 2008/07/06 12:49

I think this works because you can set a target level of accuracy that is worst than the optimal, therefore an anytime-algorithm can stop early once the desired level of accuracy is reached. Having more training data reduces the probability of repeated use of training examples, and assuming that a fresh training example generally improves a solution more than a repeated training example, then expected training time decreases.

Nice! But the target level of accuracy seems to be quite crucial.

John Langford, 2008/07/07 03:28

I wanted to expand on my comment during the talk a little bit. A basic claim is that for most learning algorithms, we can make the computational time required to achieve some level of performance be monotonically decreasing with the sample set size by adding a binary search step.

The trick is to create random subsets of the data Latex where the computational time required by the learning algorithm applied to Latex is twice the computation required on Latex, and then halt as soon as the desired error threshold is reached. Upon halting, this meta-algorithm requires only a factor of 2 more computation than if you had guessed the correct Graph in advance.

This meta-algorithm avoids the embarrassment of having the computation increase with dataset size, but it doesn't quite achieve a reduction in computational time with respect to dataset size(*). To achieve that, it seems necessary to have some parameter which controls the halting condition for learning, allowing the learning algorithm to halt sooner with larger datasets. In many cases such a parameter also exists.

(*) Thanks to Ben and Carlos for clarifying this last night.

Nati Srebro, 2008/07/07 18:10

John,

Your observation is indeed correct, and I think it is also practically important. In fact, this strengthens the point from the first slide which is that runtime should certainly not INCREASE with the amount of data, but at worst remain constant and hopefully even decrease. Consequently, we should study runtime this way, and not, as it is usually done, as an increasing function of data set size.

To achieve [a runtime reduction], it seems necessary to have some parameter which controls the halting condition
for learning, allowing the learning algorithm to halt sooner with larger datasets. In many cases such a parameter
also exists.

Indeed, you can make most methods stop early. However, with most methods this will not help you (that is, other than stochastic/bagging/online type methods). If you run on a subsample, there is no loss, but also no gain, and so you can't stop any earlier. If you use a bigger working data set size, the extra work per iteration outweighs the benefit of earlier stopping very quickly. This is what I was referring to as the “double edged sword” and what we study quantitatively and plot.

Nati

PS Thanks for the comments. This is a great way to receive feedback and exchange ideas.

Enter your comment (wiki syntax is allowed):
HTPJO
 
paper/2008/266.txt · Last modified: 2009/05/24 18:48 (external edit)
 
Driven by DokuWiki