Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Advanced Methods and Analysis for the Learning and Social Sciences

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Advanced Methods and Analysis for the Learning and Social**Sciences PSY505Spring term, 2012 March 12, 2012**Today’s Class**• Clustering**Clustering**• You have a large number of data points • You want to find what structure there is among the data points • You don’t know anything a priori about the structure • Clustering tries to find data points that “group together”**Clustering**• What types of questions could you study with clustering?**Related Topic**• Factor Analysis • Not the same as clustering • Factor analysis finds how data features/variables/items group together • Clustering finds how data points/students group together • In many cases, one problem can be transformed into the other • But conceptually still not the same thing • Next class!**Trivial Example**• Let’s say your data has two variables • Pknow • Time • Clustering works for (and is equally effective in) large feature spaces**+3**0 -3 time 0 1 pknow**k-Means**+3 0 -3 time 0 1 pknow**Not the only clustering algorithm**• Just the simplest**How did we get these clusters?**• First we decided how many clusters we wanted, 5 • How did we do that? More on this in a minute • We picked starting values for the “centroids” of the clusters… • Usually chosen randomly**How did we get these clusters?**• First we decided how many clusters we wanted, 5 • How did we do that? More on this in a minute • We picked starting values for the “centroids” of the clusters… • For instance…**+3**0 -3 time 0 1 pknow**Then…**• We classify every point as to which centroid it’s closest to • This defines the clusters • This creates a “voronoi diagram”**+3**0 -3 time 0 1 pknow**Then…**• We re-fit the centroids as the center of the points in the cluster**+3**0 -3 time 0 1 pknow**Then…**• Repeat until the centroids stop moving**+3**0 -3 time 0 1 pknow**+3**0 -3 time 0 1 pknow**+3**0 -3 time 0 1 pknow**+3**0 -3 time 0 1 pknow**+3**0 -3 time 0 1 pknow**What happens?**• What happens if your starting points are in strange places? • Not trivial to avoid, considering the full span of possible data distributions**What happens?**• What happens if your starting points are in strange places? • Not trivial to avoid, considering the full span of possible data distributions • There is some work on addressing this problem**+3**0 -3 time 0 1 pknow**+3**0 -3 time 0 1 pknow**Solution**• Run several times, involving different starting points • cf. Conati& Amershi (2009)**How many clusters should you have?**• Can you use goodness of fit metrics?**Mean Squared Deviation(also called Distortion)**• MSD = • Take each point P • Find the center of P’s cluster C • Find the distance D from C to P • Square D to get D’ • Sum all D’ to get MSD**Any problems with MSD?**• More clusters almost always leads to smaller MSD • Distance to nearest cluster center should always be smaller with more clusters**What about cross-validation?**• Will that fix the problem?**What about cross-validation?**• Not necessarily • This is a different problem than classification • You’re not trying to predict specific values • You’re determining whether any center is close to a given point • More clusters cover the space more thoroughly • So MSD will often be smaller with more clusters, even if you cross-validate**An Example**• 14 centers, ill-chosen (what you’d get on a cross-validation with too many centers) • 2 centers, well-chosen (what you’d get on a cross-validation with not enough centers)**+3**0 -3 time 0 1 pknow**+3**0 -3 time 0 1 pknow**An Example**• The ill-chosen 14 centers will achieve a better MSD than the well-chosen 2 centers**Solution**• Penalize models with more clusters, according to how much extra fit would be expected from the additional clusters**Solution**• Penalize models with more clusters, according to how much extra fit would be expected from the additional clusters • What comes to mind?**Solution**• Penalize models with more clusters, according to how much extra fit would be expected from the additional clusters • Common approach – the Bayesian Information Criterion**Bayesian Information Criterion(Raftery, 1995)**• Assesses how much fit would be spuriously expected from a random N parameters • Assesses how much fit you actually had • Finds the difference**Bayesian Information Criterion(Raftery, 1995)**• The math is… painful • See Raftery, 1995; many statistical packages can also compute this for you**So how many clusters?**• Try several values of k • Find “best-fitting” set of clusters for each value of k • Choose k with best value of BiC**Let’s do a set of hands-on exercises**• Apply k-means using the following points and centroids • Everyone gets to volunteer!**+3**0 -3 time 0 1 pknow**+3**0 -3 time 0 1 pknow