Dasgupta, Sanjoy and Hsu, Daniel
We present an active learning scheme that exploits cluster structure in data.
Very interesting paper.
As noted in the Fallback analysis section, the algorithm would perform somewhat worse than random sampling if there were some hopelessly mixed clusters, where going down into smaller clusters doesn't help.
I was wondering if it would help if preference was given in sampling not to those clusters that are less pure but to those that have class distributions most different from that of their subclusters. In this way both pure and completely mixed clusters would be avoided.
Would something like a k-means++ scheme work for this kind of problem? That weights an instance by how far it is from an existing centroid, then samples by weight. It seems to be getting at the same idea of balancing outliers with representative data. The analysis assumes some degree of cluster coherence.
I think that could work in some cases. The analysis of k-means++ due to Ostrovsky et al (FOCS 2006) might suggest when this would be successful, and there may be a way to extend it to a hierarchical clustering in a manner similar to that of Dasgupta and Long (JCSS 2005).