We show how nonlinear embedding algorithms popular for use with shallow semi-supervised learning techniques such as kernel methods can be applied to deep multi-layer architectures, either as a regularizer at the output layer, or on each layer of the architecture. This provides a simple alternative to existing approaches to deep learning such as autoassociators or density estimation, whilst yielding competitive error rates compared to those methods, and existing shallow semi-supervised techniques.
Discussion
Very interesting paper! The experimental SRL results are amazingly good - especially considering the very weak notion of similarity used in embedding (if I understand it correctly). The similarity for RCV is a lot stronger - since there are so many classes - but it's surprising that just this contrastive-divergence sort of trick would be sufficiently constraining to help on an NLP problem.