Regularization and Feature Selection in Least-Squares Temporal Difference Learning (2009)

Authors

Abstract

We consider the task of reinforcement learning with linear value function approximation. Temporal-difference algorithms, and in particular the Least Squares Temporal-Difference (LSTD) algorithm, provide a method to learn the parameters of the value function, but when the number of features is large this algorithm can over-fit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of L_1 regularization, which is robust to irrelevant features and also serves as method for feature selection. Although the L_1 regularized LSTD solution cannot be expressed as as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally.

Discussion

Matt Gershoff, 2009/06/07 15:38

Interesting paper but a couple of quick questions. In the chain domain example, the irrelevant features are drawn from white noise. How would the results be affected in the presence of multicollinearity - maybe by constructing the irrelevant features by adding a small noise term to the relevant features? Also it looks as if the variance of the average discounted reward is increasing with sample size and is certainly larger than that of the L2 norm. Would this trend continue as we add samples - and if so any thoughts on why and the impact on the utility of the algorithm?

Zico Kolter, 2009/06/08 12:43

Hi Matt, Thanks for the comments. The irrelevant features for the chain domain were certainly chosen to demonstrate a point, that truly irrelevant features would be discarded by the algorithm, but I don't think that the algorithm would be negatively affected by “less random” features. Indeed, I think the primary effect would be that standard LSTD algorithms would then start to work perfectly well (for small enough noise added to the “relevant” features), since sparsity of the features would no longer be necessary for reconstructing a good estimate of the value function. Certainly by raising the amount of noise we could interpolate however we wanted between this case and the one we demonstrate, but I see no reason why the performance of the algorithms wouldn't interpolate smoothly as well. However, admittedly this is just my intuition for the time being.

The higher variance of the l1-regularization on the chain task is an interesting effect, and I think in this case largely due to the nature of LSPI-type policies (though perhaps made more extreme by the feature selection element): in particular, since we are construction state-action value functions using linear function approximation (and then executing a greedy policy), a small change in one of the state-action value functions near one of the reward states, causing the policy to swap its action in a single state, can greatly degrade the performance of the algorithm. This is what we saw over the 20 runs: while most learned policies were nearly optimal, one or two of the l1 regularized policies were significantly poorer, decreasing the mean and increasing the variance. In contrast, the l2 regularized policies had lower variance, but were uniformly poorer than the l1 regularized policies. As mentioned above, I do think that the feature selection we use may exacerbate this effect (this is also the case in Figure 3 (a) to a smaller degree, for instance), but I don't yet have a concrete theory of when/where this occurs, as it is certainly tied up in the issues of policy iteration as well.

Matt Gershoff, 2009/06/08 15:57

Zico, Thanks for the response. Yes, it was figure 3(a) that caught my attention. My question was less about the difference in variance between the L1 and L2 estimates and more about what appears to be growth in the variance for the L1 ave disc. reward with respect to sample size. I wasn't sure if that was just an artifact or if that is expected behavior - at 10,000 or 20,000 samples would we see even larger variances? If so, is there a reason why additional information would lead to greater uncertainty around the estimate and does that place some additional bounds on when to use the L1 vs the L2? Thanks.

Enter your comment (wiki syntax is allowed):
VZTNE
 
paper/2009/439.txt · Last modified: 2009/05/24 18:43 (external edit)
 
Driven by DokuWiki