Unsupervised Feature Subset Selection
Studenteropgave: Kandidatspeciale og HD afgangsprojekt
- Nicolaj Søndberg-Madsen
- Casper Thomsen
4. semester, Datalogi, Kandidat (Kandidatuddannelse)
This master thesis has been developed in the domain of Decision
Support Systems and it covers the sparsely researched area of
unsupervised feature subset selection for data clustering. In the
report we discuss what characterizes features that are relevant for
data clustering and we propose new relevance score measures which are
capable of producing a ranking of the features with respect to their
relevance. The relevance scores, combined with a threshold, can be
used in a filter approach where the uninformative features are
discarded. The report proposes two methods for setting a threshold and
the score measures are tested empirically on 3 synthetic data sets and
4 real world data sets. In a second step we propose to use the
relevance rankings in a hybrid approach to performing unsupervised
feature subset selection. This method allows us to perform
unsupervised feature subset selection with less model inductions than
ordinary wrapper approaches. Empirical tests show both the filter and
hybrid approaches to perform satisfactory.
Sprog | Engelsk |
---|---|
Udgivelsesdato | jun. 2003 |