Clustering and Classification

Summer School in Statistics for Astronomers

Eric D. Feigelson Spring 2021

Adapted from R scripts in Appendix B, Modern Statistical Methods for Astronomy With R Applications, Eric D. Feigelson & G. Jogesh Babu 2012 http://astrostatistics.psu.edu/MSMA Improvements by Gabriel Caceres 2017.

The popular phrases data science and data mining do not have strict definitions, but a common theme is to separate heterogeneous datasets into distinct groupings. When only the dataset if available without prior knowledge of the groups, the problem is called clustering. When prior knowledge in the form of training sets with labels for the groups are available, the problem is called classification. The availability of reliable training sets can be remarkably helpful: classification procedures often give far more effective results than clustering procedures.

I Nonparametric clustering

Here we exercise several nonparametric procedures to identify groupings in multivariate datasets; that is, tabular data with scalar quantities in rows and columns. These procedures are algorithmic: they generally are not based on principles like 'maximum likelihood estimation' founded on powerful theorems. They often have one or more adjustable parameters with little mathematical advice on settting their values. Consequently, the scientific results from nonparametric clustering often depends on the chosen procedure and thresholds with reduced reproducibility and reliability.

We apply clustering techniques to a famous discovery in extragalactic astronomy. A 17-band photometric survey of galaxies, called COMBO-17, revealed a bifurcation in the color magnitude diagram of normal galaxies showing a red sequence and a blue cloud separated by a green valley (Bell et al. 2004). We study this here in a bivariate color-magnitude diagram using unsupervised methods. Note that we use a 2D dataset only for convenience; the statistical methods are designed for multi-dimensional problem although some are effectively restricted to relatively few dimensions.

Visual examination of the scatter plot shows an elongated distribution where redder galaxies are more luminous. There are hints of structure in the diagram: a handful the extreme red-luminous corner, and possible bimodality where the galaxies are most common, and a few outliers with very red colors but intermediate luminosity.

Let us first apply a often useful nonparametric smoothing procedure known as kernel density estimation where a point process is convolved with a kernet function, here a 2-dimensional Gaussian. We use R's function kde2d.

This smoothed distribution dramatically shows a bifurcation in the galaxy distribution. The text labels give the commonly used designations: blue cloud, green valley, red sequence, and bright cluster galaxies (BCGs).

II Two nonparametric unsupervised clustering procedure

Many clustering and classification procedures locate groupings based on some measure of distance in multivariate p-space. Formulating a distance metric is not trivial when the variables have different units and ranges. In addition, astronomers often habitually apply logarithmic transformations to wide-range variables and do not transform narrow-range variables. This will affect the distance metric.

First, it is necessary to somehow normalize the distance measures so variables with wide ranges do not dominate over variables with small ranges. The most common procedure is called standardization where each variable scaled by subtracting its mean and dividing by its standard deviation. All variables then range from roughly -1 to 1 with zero mean.

Once the variables are standardized, a distance metric is devised. Typically this is the Euclidean distance based on the Pythogorean formula, but other metrics can be used. Again, the results of a clustering algorithm may depend on this choice.

We now construct a dendrogram or tree for hierarchical clustering. The procedure is to join the closest two points into a cluster which is then considered to be a single point. This process is iterated until the entire sample is a single cluster. However, choices must be made on the meaning of distance between two clusters: Do we consider the closest members of the two clusters (single-linkage)? the centroid of each cluster (average-linkage)? the furthest members of each cluster (complete-linkage) or other choice? Here we use an intermediate choice known as Ward`s method which leads to a minimum variance clustering.

We construct the resulting dendrogram using R's hclust. While one might think that the first split corresponds to the valley between the red sequence and blue cloud, this did not appear until one choose k=5 clusters. We then plot the original data with five symbol shapes reflecting the five clusters.

The results are not very satisfying. The BCGs have blurred together with less luminous red outliers, the red sequence is divided into two portions, and the blue cloud include two very blue outliers.

Exercise 1: Read the help files associated with hierarchical clustering and try different parameters.

A different approach was developed in the 1990s by computer scientist called DBSCAN, Density-Based Spatial Clustering with Noise (Ester et al. 1996, 21K citations). It starts by finding points with many close neighbors within a specified distance (the reach parameter), merges them, and percolates outward until no new neighbors within the reach are found. Then sparse clusters with fewer than minPts are excluded. Nt all objects must be included in clusters; rather clusters are assumed to lie in low-density noise.

In the plot above, we set both the color and symbol shapes by the resulting cluster vector. Despite the apparent advantages of DBSCAN over hierarchical clustering, it still has two parameters that must be arbitrarily tuned and often has weak performance. Here, attempts to include more galaxies in the red sequence or blue cloud resulted in unifying them, as the algorithm reached over the green valley.

DBSCAN was improved with a more complicated hierarchical DBSCAN (HDBSCAN) procedure by Campello et al. 2013 to give additional dynamic range and ability to separage entangled clusters. We find here that CRAN function hdbscan immediately identified the principal red sequence and blue cloud groupings, and gathers a reasonable fraction of their members. No additional clusters are identified. Various outputs and graphics are provided such as cluster stability scores, outlier scores, and individual cluster membership probabilities. HDBSCAN is one of the top performing nonparametric clustering procedures, and since 2017 has been used in several dozen astronomical studies.

Exercise 2: Read the help file of DBSCAN and the vignette for HDBSCAN. Make a color-magnitude diagram highlighting outliers.

In conclusion, it was difficult to get standard nonparametric clustering algorithms perform well, even in relatively simple situations. Results differ with choice of method, clustering parameters had to be carefully tuned, and even then the results may not very satisfactory. In our case here, simple kernel smoothing did a better job in showing the structure, although it does not define cluster membership.

Astronomers most commonly use single-linkage hierarchical clustering, nicknamed the friends-of-friends algorithm. It has been used in nearly 4000 studies in the past three decades. This is perhaps the worst choice because, it has been long established, single-linkage spuriously chains together unrelated groupings in the presence of noise. It is most effective in identifying outliers in multivariate datasets.

Exercise 3: Apply one of the recent (2008) best-performing clustering algorithms called t-distributed stochastic neighbor embedding or more briefly, t-SNE. It is implemented in CRAN packages tsne, Rtsne, M3C, and mmtsne. For background and cookbooks, see tutorials here, here and here

III Parametric clustering procedures

We do not have time in this tutorial to present model-based clustering methods, mostly commonly Gaussian mixture models. Here the astronomer must assume that the cluster shapes follow multivariate normal distributions. Optimal solutions are found by maximum likelihood estimation with $k$, the number of clusters, chosen by optimizing the Bayesian Information Criterion. The calculation can be challenging; a dataset with 6 variables and 5 clusters, for example, will have a 61-dimensional likelihood function (6 means and 6 standard deviations for each cluster, and k)

The result of parametric clustering analysis is often satisfying. Best fit parameters give the mean location and standard deviation of each cluster in each variable. Each object`s membership probability in each cluster is available. Residual analysis can reveal structure missing from the model.

The basic text for GMMs is Finite Mixture Models (McLachlan & Peel 2000, 11K citations). The volume Model-based Clustering and Classification for Data Science (Bouveyron et al. 2019) gives sophisticated extensions to the standard procedure, discussing strengths and pitfalls of applying GMMs in the R/CRAN environment. Review articles on GMMs, including on astronomical usage, appear in the CRC Handbook of Mixture Analysis (2020).

Exercise 4: Run mclust on the COMBO-17 color-magnitude diagram. Follow the procedures in Chapters 2 and 3 of Bouveyron`s book.

IV Supervised classification methods

We next apply a number of important supervised classifiers to the problem of classifying point sources in the Sloan Digital Sky Survey into three groups: normal stars, white dwarf stars, and quasars that appear star-like but are actually distant extragalactic objects. We treat the problem in four dimensions of color indices between the five Sloan photometric bands. Note that we could add other variables, such as magnitude in a fiducial band and location in the sky.

In the script below, we first ingest a test set of ~12,000 point sources from the Sloan Digital Sky Survey, and then three training sets: 2000 quasars, 2000 white dwarfs, and 5000 main sequence and giant stars. The training samples are based on tedious spectroscopic confirmation with other telescopes. We combine the three training sets into a single R data.frame with a new column giving the known cluster identifiers (1=quasar, 2=star, 3=white dwarf). We then divide the traing set into 80% for designing the classifiers and 20% for validation. Our science goal is to classify the test dataset based on the best classifiers.

We first create and plot 2D projections of the test set data frame. Then we combine the three training sets into a single R data.frame with a new column giving the known cluster identifiers (1=quasar, 2=star, 3=white dwarf). Our science goal is to classify the test dataset based on the best classifiers we can generate from the training sets.

To assist with evaluation of different classifiers, let's make a function that prints and plots the confusion matrix showing the number of objects correctly and incorrectly classified. The accuracy value summarizes the classifier's performance ... we want it to be as close to 1.00 as possible.

We will use 80% of the training set for classification, and save 20% of the training set for classifier validation.

Before supervised classification, we make two tests. First, we show that random classification assignments will give an accuracy of 0.33 for three classes. This is obvious. Second, we try an unsupervised clustering algorithm, k-means partitioning. Since clustering algorithms performed poorly for the simpler red sequence vs. blue cloud galaxy groups above, we do not expect them to do well discriminating these overlapping, multivariate distributions with very non-Gaussian morphologies. The k-means procedure did not begin to separate the groups until k>5, but then it created false groupings amoung the normal stars. Altogether, k-means partitioning does a terrible job separating the three classes.

We now proceed with six multivariate classifiers commonly used in data mining: linear discriminant analysis; k-nearest neighbor classification; a simple neural network; Classification And Regression Trees; CART with Random Forests; and Support Vector Machine.

The R script is similar for each of the classification methods. We first bring the classifier into our R session, run it on the 80% training dataset, and save the results in an R object. We then produce two plots: one of the Sloan color-color diagrams with colors based on the new classifications; and a confusion matrix plot showing the predicted classes vs. the known classes for the 20% validation dataset. In some cases, we use R's predict function to apply the classifier to new datasets.

Exercise 5: As you proceed with each classifier, read the help files and run str, print and plot on the output of these classifiers gain some understanding of its functioning.

Surprisingly, linear discriminant analysis did not do poorly, even though it is limited to single 4-dimensional a hyperplane that optimally separates the classes assuming multivariate normal distributions. Even though the shapes are far from normal, and the separators are not linear, the classification accuracy is 93%.

Here we see that the k-nearest neighbor algorithm -- a simple voting procedure among neighbors in 4-dimensional color space -- performed very well for this classification problem with only 2% misclassifications. Note that k-NN (like LDA and some other classifiers) depend on a distance matrix. This makes sense when (as in this case) all variables have the same units and similar range, but can be problematic when variables of very different units and ranges are combined. Also, there is a worry that the k-NN voting will depend on the artificial choice made of the sizes of the training sets (2000 for quasars, 2000 for white dwarfs, and 5000 for stars). As we will see when we apply a classifier to the test set, these sample numbers are far from realistic. For example, the test set has many fewer white dwarfs than the other classes, so there is a danger of biasing the votes with too many white dwarfs in the training set.

Here we apply a very simple neural network with only one hidden layer with 5 nodes. Yet it gives a classification accuracy of 96%. Some quasars are incorrectly classified as horizontal branch giant stars. I nvestigation shows that the performance is improved with a larger and deeper network, with repeated back-propagation steps. This can take considerably more computing time.

This CART classification gave 95% accuracy. There are many options to CART, and performance may improve with tuning operating parameters. Below are two diagnostic diagrams, one showing the tree splits and the other the criterion for pruning the tree to ~12 nodes. A major advantage of CART is its single-variable splitting at each node that improves interpretability. The astronomer might be quite interested, for example, that the most important splits are at $u-g=0.79$, $u-g=2.83$, and $r-i=0.03$.

We now arrive at the two classification methods that gained the most widespread usage during the 2000-2019 decades: Random Forests and SVM.

We have run Random Forest, where many decision trees are constructed and averaged using strategies of boosting and bagging. We run the classification evaluation on three subsets of the training data: the 80% training data (the 100% accuracy is not interesting here), the out of bag (oob) subset representing a validation set internal to the Random Forest, and our 20% validation subset. These give 98% accuracy. Random Forest has many options, and the performance may be improvable.

The Support Vector Machine calculation is computer intensive, as it performs many operations, some in higher dimensions. But the performance is excellent with 98% recovery of true classes in the 20% validation training set.

Exercise 6: Take one or more of the sophisticated classifiers -- neural nets, Random Forests, SVM -- and study its performance carefully for the problem here. Read the help files, and appropriate chapter in a book on machine learning. Examine the output R objects with str and plot various quantities. Tweak various parameters of the method, and study the effect on the classifier performance.

We are now ready to stand back from our statistical analysis and examine the situation from a scientific (astronomical) viewpoint. The k-NN, Random Forest and SVM classifiers both showed excellent performance on the 20% validation dataset, with scientifically reasonable results on the test dataset. But there were some methodology questions about the k-NN procedure (e.g. effect of training set size on voting, value of k). So, it is reasonable to decide that the Random Forest and SVM classifiers are the best.

Exercise 7: Produce the 2x2 contingency matrix -- True Positive, False Positive, True Negative, False Negative -- for the SVM and Random Forest classifiers. Calculate (and try to interpret) various scalar measures of its performance such as Accuracy, Sensitivity, Specificity, Precision & F1 Score; see details on Wikipedia here.

The last step is to bring the results out of R for use by software in other languages, by colleagues, and for publication in a journal. We need publication-quality graphics and a table of the Sloan test set (12K lines) with a new column showing the derived classification of each object. Note that we could, using library(xtable) produce the table in LaTeX format.

Exercise 8: Random Forest is often performs very well, but more recent related methods — such as XGBoost and LightGBM — are more computationally efficient and may give better treatment of complex problems. LightGBM, for example, has won many competitions since 2017 and is prominent in the 2018 astronomical LSST PLAsTiCC challenge. The exercise here is to apply XGBoost and LightGBM to the SDSS point source classification problem. See CRAN packages xgboost and lightgbm with additional features in EIX and SHAPforxgboost.