Adapted from R scripts in Appendix B, Modern Statistical Methods for Astronomy With R Applications, Eric D. Feigelson & G. Jogesh Babu 2012 http://astrostatistics.psu.edu/MSMA Improvements by Gabriel Caceres 2017.
The popular phrases data science and data mining do not have strict definitions, but a common theme is to separate heterogeneous datasets into distinct groupings. When only the dataset if available without prior knowledge of the groups, the problem is called clustering. When prior knowledge in the form of training sets with labels for the groups are available, the problem is called classification. The availability of reliable training sets can be remarkably helpful: classification procedures often give far more effective results than clustering procedures.
# Setup
#install.packages('fpc', repos='https://cloud.r-project.org') # Flexible Procedures for Clustering
library(fpc)
#install.packages('class', repos='https://cloud.r-project.org') # Functions for Classification
library(class)
#install.packages('randomForest', repos='https://cloud.r-project.org') # Random Forests
library(randomForest)
#install.packages('e1071', repos='https://cloud.r-project.org') # Support Vector Machine
library(e1071)
Here we exercise several nonparametric procedures to identify groupings in multivariate datasets; that is, tabular data with scalar quantities in rows and columns. These procedures are algorithmic: they generally are not based on principles like 'maximum likelihood estimation' founded on powerful theorems. They often have one or more adjustable parameters with little mathematical advice on settting their values. Consequently, the scientific results from nonparametric clustering often depends on the chosen procedure and thresholds with reduced reproducibility and reliability.
We apply clustering techniques to a famous discovery in extragalactic astronomy. A 17-band photometric survey of galaxies, called COMBO-17, revealed a bifurcation in the color magnitude diagram of normal galaxies showing a red sequence
and a blue cloud
separated by a green valley
(Bell et al. 2004). We study this here in a bivariate color-magnitude diagram using unsupervised methods. Note that we use a 2D dataset only for convenience; the statistical methods are designed for multi-dimensional problem although some are effectively restricted to relatively few dimensions.
COMBO_loz=read.table('COMBO17_lowz.dat', header=T, fill=T)
str(COMBO_loz)
dim(COMBO_loz)
names(COMBO_loz)
names(COMBO_loz) <- c('MB', 'M280-MB') ; names(COMBO_loz)
plot(COMBO_loz, pch=20, cex=0.5, xlim=c(-22,-7), ylim=c(-2,2.5),
xlab=expression(M[B]~~(mag)), ylab=expression(M[280] - M[B]~~(mag)),
main='')
Visual examination of the scatter plot shows an elongated distribution where redder galaxies are more luminous. There are hints of structure in the diagram: a handful the extreme red-luminous corner, and possible bimodality where the galaxies are most common, and a few outliers with very red colors but intermediate luminosity.
Let us first apply a often useful nonparametric smoothing procedure known as kernel density estimation where a point process is convolved with a kernet function, here a 2-dimensional Gaussian. We use R's function kde2d.
# Two-dimensional kernel-density estimator
library(MASS)
COMBO_loz_sm <- kde2d(COMBO_loz[,1], COMBO_loz[,2], h=c(1.6,0.4),
lims = c(-22,-7,-2,2.5), n=500)
image(COMBO_loz_sm, col=grey(13:0/15), xlab=expression(M[B]~~(mag)),
ylab=expression(M[280] - M[B]~~(mag)), xlim=c(-22,-7), ylim=c(-2,2.5),
xaxp=c(-20,-10,2))
text(-16.5, -1, "Blue cloud", col='darkblue', pos=4, cex=0.8)
text(-17,-0.7, 'Green valley', col='darkgreen', pos=4, cex=0.8)
text(-13, -0.2, 'Red sequence', col='red', pos=4, cex=0.8)
text(-18.5, 1.7, 'Bright cluster galaxies', col='deeppink3', pos=4, cex=0.8)
dev.copy2pdf(file='COMBO17_CMD.pdf')
This smoothed distribution dramatically shows a bifurcation in the galaxy distribution. The text labels give the commonly used designations: blue cloud, green valley, red sequence, and bright cluster galaxies (BCGs).
Many clustering and classification procedures locate groupings based on some measure of distance in multivariate p-space. Formulating a distance metric is not trivial when the variables have different units and ranges. In addition, astronomers often habitually apply logarithmic transformations to wide-range variables and do not transform narrow-range variables. This will affect the distance metric.
First, it is necessary to somehow normalize the distance measures so variables with wide ranges do not dominate over variables with small ranges. The most common procedure is called standardization where each variable scaled by subtracting its mean and dividing by its standard deviation. All variables then range from roughly -1 to 1 with zero mean.
Once the variables are standardized, a distance metric is devised. Typically this is the Euclidean distance based on the Pythogorean formula, but other metrics can be used. Again, the results of a clustering algorithm may depend on this choice.
# Standardize variables and compute a Euclidean distance matrix
Mag_std <- scale(COMBO_loz[,1])
Color_std <- scale(COMBO_loz[,2])
COMBO_std <- cbind(Mag_std,Color_std)
COMBO_dist <- dist(COMBO_std)
We now construct a dendrogram or tree for hierarchical clustering. The procedure is to join the closest two points into a cluster which is then considered to be a single point. This process is iterated until the entire sample is a single cluster. However, choices must be made on the meaning of distance between two clusters: Do we consider the closest members of the two clusters (single-linkage)? the centroid of each cluster (average-linkage)? the furthest members of each cluster (complete-linkage) or other choice? Here we use an intermediate choice known as Ward`s method which leads to a minimum variance clustering.
We construct the resulting dendrogram using R's hclust. While one might think that the first split corresponds to the valley between the red sequence and blue cloud, this did not appear until one choose k=5 clusters. We then plot the original data with five symbol shapes reflecting the five clusters.
# Hierarchical clustering
COMBO_hc <- hclust(COMBO_dist, method='ward.D')
plot(COMBO_hc, label=F)
# Cutting the tree at k=5 clusters
COMBO_hc5a <- rect.hclust(COMBO_hc, k=5, border='red')
COMBO_hc5b <- cutree(COMBO_hc, k=5)
# Show this hierarchical clustering solution
plot(COMBO_loz, pch=(COMBO_hc5b+18), cex=0.7, xlab=expression(M[B]~~(mag)),
ylab=expression(M[280] - M[B]~~(mag)), main='')
The results are not very satisfying. The BCGs have blurred together with less luminous red outliers, the red sequence is divided into two portions, and the blue cloud include two very blue outliers.
Exercise 1: Read the help files associated with hierarchical clustering and try different parameters.
A different approach was developed in the 1990s by computer scientist called DBSCAN, Density-Based Spatial Clustering with Noise (Ester et al. 1996, 21K citations). It starts by finding points with many close neighbors within a specified distance (the reach parameter), merges them, and percolates outward until no new neighbors within the reach are found. Then sparse clusters with fewer than minPts are excluded. Nt all objects must be included in clusters; rather clusters are assumed to lie in low-density noise.
# Density-based clustering algorithm
# install.packages('fpc')
library(fpc)
COMBO_dbs <- dbscan(COMBO_std, eps=0.1, MinPts=5, method='raw')
print.dbscan(COMBO_dbs) ; COMBO_dbs$cluster
plot(COMBO_loz[COMBO_dbs$cluster==0,], pch=20, cex=0.7, xlab='M_B (mag)',
ylab='M_280 - M_B (mag)')
points(COMBO_loz[COMBO_dbs$cluster==2,], pch=2, cex=1.0, col='blue2')
points(COMBO_loz[COMBO_dbs$cluster==1 | COMBO_dbs$cluster==3,], pch=1, cex=1.0, col='orangered3')
In the plot above, we set both the color and symbol shapes by the resulting cluster vector. Despite the apparent advantages of DBSCAN over hierarchical clustering, it still has two parameters that must be arbitrarily tuned and often has weak performance. Here, attempts to include more galaxies in the red sequence or blue cloud resulted in unifying them, as the algorithm reached over the green valley.
DBSCAN was improved with a more complicated hierarchical DBSCAN (HDBSCAN) procedure by Campello et al. 2013 to give additional dynamic range and ability to separage entangled clusters. We find here that CRAN function hdbscan immediately identified the principal red sequence and blue cloud groupings, and gathers a reasonable fraction of their members. No additional clusters are identified. Various outputs and graphics are provided such as cluster stability scores, outlier scores, and individual cluster membership probabilities. HDBSCAN is one of the top performing nonparametric clustering procedures, and since 2017 has been used in several dozen astronomical studies.
Exercise 2: Read the help file of DBSCAN and the vignette for HDBSCAN. Make a color-magnitude diagram highlighting outliers.
library(dbscan)
COMBO_hdbs <- hdbscan(COMBO_std, minPts=25)
print(COMBO_hdbs)
print(COMBO_hdbs$cluster_scores)
cat('COMBO-17 galaxy ', which.max(COMBO_hdbs$outlier_scores), ' is the most extreme outlier')
plot(COMBO_hdbs)
plot(COMBO_hdbs$hc)
In conclusion, it was difficult to get standard nonparametric clustering algorithms perform well, even in relatively simple situations. Results differ with choice of method, clustering parameters had to be carefully tuned, and even then the results may not very satisfactory. In our case here, simple kernel smoothing did a better job in showing the structure, although it does not define cluster membership.
Astronomers most commonly use single-linkage hierarchical clustering, nicknamed the friends-of-friends algorithm. It has been used in nearly 4000 studies in the past three decades. This is perhaps the worst choice because, it has been long established, single-linkage spuriously chains together unrelated groupings in the presence of noise. It is most effective in identifying outliers in multivariate datasets.
Exercise 3: Apply one of the recent (2008) best-performing clustering algorithms called t-distributed stochastic neighbor embedding or more briefly, t-SNE. It is implemented in CRAN packages tsne, Rtsne, M3C, and mmtsne. For background and cookbooks, see tutorials here, here and here
We do not have time in this tutorial to present model-based clustering methods, mostly commonly Gaussian mixture models. Here the astronomer must assume that the cluster shapes follow multivariate normal distributions. Optimal solutions are found by maximum likelihood estimation with $k$, the number of clusters, chosen by optimizing the Bayesian Information Criterion. The calculation can be challenging; a dataset with 6 variables and 5 clusters, for example, will have a 61-dimensional likelihood function (6 means and 6 standard deviations for each cluster, and k)
The result of parametric clustering analysis is often satisfying. Best fit parameters give the mean location and standard deviation of each cluster in each variable. Each object`s membership probability in each cluster is available. Residual analysis can reveal structure missing from the model.
The basic text for GMMs is Finite Mixture Models (McLachlan & Peel 2000, 11K citations). The volume Model-based Clustering and Classification for Data Science (Bouveyron et al. 2019) gives sophisticated extensions to the standard procedure, discussing strengths and pitfalls of applying GMMs in the R/CRAN environment. Review articles on GMMs, including on astronomical usage, appear in the CRC Handbook of Mixture Analysis (2020).
Exercise 4: Run mclust on the COMBO-17 color-magnitude diagram. Follow the procedures in Chapters 2 and 3 of Bouveyron`s book.
We next apply a number of important supervised classifiers to the problem of classifying point sources in the Sloan Digital Sky Survey into three groups: normal stars, white dwarf stars, and quasars that appear star-like but are actually distant extragalactic objects. We treat the problem in four dimensions of color indices between the five Sloan photometric bands. Note that we could add other variables, such as magnitude in a fiducial band and location in the sky.
In the script below, we first ingest a test set
of ~12,000 point sources from the Sloan Digital Sky Survey, and then three training sets
: 2000 quasars, 2000 white dwarfs, and 5000 main sequence and giant stars. The training samples are based on tedious spectroscopic confirmation with other telescopes. We combine the three training sets into a single R data.frame with a new column giving the known cluster identifiers (1=quasar, 2=star, 3=white dwarf). We then divide the traing set into 80% for designing the classifiers and 20% for validation. Our science goal is to classify the test dataset based on the best classifiers.
We first create and plot 2D projections of the test set data frame. Then we combine the three training sets into a single R data.frame with a new column giving the known cluster identifiers (1=quasar, 2=star, 3=white dwarf). Our science goal is to classify the test dataset based on the best classifiers we can generate from the training sets.
# SDSS point sources test dataset, N=17,000 (mag<21, point sources, hi-qual)
SDSS <- read.csv('SDSS_test.csv', h=T)
dim(SDSS) ; summary(SDSS)
SDSS_test <- data.frame( # create data frame of SDSS colors
u_g = SDSS$u_mag - SDSS$g_mag,
g_r = SDSS$g_mag - SDSS$r_mag,
r_i = SDSS$r_mag - SDSS$i_mag,
i_z = SDSS$i_mag - SDSS$z_mag
)
names(SDSS_test) <- c('u_g', 'g_r', 'r_i', 'i_z')
str(SDSS_test)
par(mfrow=c(1,3)) # plot test set in black because labels are not yet established
plot(SDSS_test[,1], SDSS_test[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8),pch=20,
cex=0.6, cex.lab=1.5, cex.axis=1.5, main='Test dataset', xlab='u-g (mag)', ylab='g-r (mag)')
plot(SDSS_test[,2], SDSS_test[,3], xlim=c(-0.7,1.8), ylim=c(-0.7,1.8), pch=20,
cex=0.6, cex.lab=1.5, cex.axis=1.5, main='', xlab='g-r (mag)', ylab='r-i (mag)')
plot(SDSS_test[,3], SDSS_test[,4], xlim=c(-0.7,1.8), ylim=c(-1.1,1.3), pch=20,
cex=0.6, cex.lab=1.5, cex.axis=1.5, main='', xlab='r-i (mag)', ylab='i-z (mag)')
par(mfrow=c(1,1))
# Quasar training set, N=2000 (Class 1)
qso1 <- read.table('SDSS_QSO.dat', h=T)
dim(qso1) ; summary(qso1)
bad_phot_qso <- which(qso1[,c(3,5,7,9,11)] > 21.0 | qso1[,3]==0) # identify bad photometry
qso2 <- qso1[1:2000,-bad_phot_qso,] # remove bad photometry
qso3 <- cbind((qso2[,3]-qso2[,5]), (qso2[,5]-qso2[,7]), (qso2[,7]-qso2[,9]), (qso2[,9]-qso2[,11])) # cbind concatenates colums
qso_train <- data.frame(cbind(qso3, rep(1, length(qso3[,1]))))
names(qso_train) <- c('u_g', 'g_r', 'r_i', 'i_z', 'Class')
dim(qso_train) ; summary(qso_train)
# Star training set, N=5000 (Class 2)
temp2 <- read.csv('SDSS_stars.csv', h=T)
dim(temp2) ; summary(temp2)
star <- cbind((temp2[,1]-temp2[,2]), (temp2[,2]-temp2[,3]), (temp2[,3]-temp2[,4]),
(temp2[,4]-temp2[,5]))
star_train <- data.frame(cbind(star, rep(2, length(star[,1]))))
names(star_train) <- c('u_g','g_r','r_i','i_z','Class')
dim(star_train)
# White dwarf training set, N=2000 (Class 3)
temp3 <- read.csv('SDSS_wd.csv', h=T)
dim(temp3) ; summary(temp3)
temp3 <- na.omit(temp3) # remove objects with missing data
wd <- cbind((temp3[1:2000,2]-temp3[1:2000,3]), (temp3[1:2000,3]-temp3[1:2000,4]),
(temp3[1:2000,4]-temp3[1:2000,5]), (temp3[1:2000,5]-temp3[1:2000,6]))
wd_train <- data.frame(cbind(wd, rep(3, length(wd[,1]))))
names(wd_train) <- c('u_g', 'g_r', 'r_i', 'i_z', 'Class')
dim(wd_train)
# Combine and plot the training set (9000 objects)
SDSS_train <- data.frame(rbind(qso_train, star_train, wd_train)) # rbind concatenates rows
names(SDSS_train) <- c('u_g', 'g_r', 'r_i', 'i_z', 'Class')
str(SDSS_train)
par(mfrow=c(1,3)) # plot training set in colors representing labeled classes
plot(SDSS_train[,1], SDSS_train[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8), pch=20,
col=SDSS_train[,5], cex=0.6, cex.lab=1.6, cex.axis=1.6, main='Training dataset', xlab='u-g (mag)',
ylab='g-r (mag)')
legend(-0.5, 1.7, c('QSO','MS + RG','WD'), pch=20, col=c('black','red','green'),
cex=0.8)
plot(SDSS_train[,2], SDSS_train[,3], xlim=c(-0.7,1.8), ylim=c(-0.7,1.8), pch=20,
col=SDSS_train[,5], cex=0.6, cex.lab=1.6, cex.axis=1.6, main='', xlab='g-r (mag)',
ylab='r-i (mag)')
plot(SDSS_train[,3], SDSS_train[,4], xlim=c(-0.7,1.8), ylim=c(-1.1,1.3), pch=20,
col=SDSS_train[,5], cex=0.6, cex.lab=1.6, cex.axis=1.6, main='', xlab='r-i (mag)',
ylab='i-z (mag)')
par(mfrow=c(1,1))
To assist with evaluation of different classifiers, let's make a function that prints and plots the confusion matrix
showing the number of objects correctly and incorrectly classified. The accuracy
value summarizes the classifier's performance ... we want it to be as close to 1.00 as possible.
## Function to evaluate classification performance
class_eval <- function(pred, act, plot=TRUE, ...){
iact <- as.integer(act)
ipred <- as.integer(pred)
acc <- sum(ipred==iact)/length(iact) # accuracy
if (isTRUE(plot)){
plot(jitter(ipred), jitter(iact), pch=20, cex=0.5, xlab='Predicted Class', ylab='True class',lab=c(3,3,1), ...)
mtext(paste("Accuracy =", round(acc, 3)))
}
return(list("Confusion Table"=table("True Class"=iact, "Predicted Class"=ipred), Accuracy=acc))
}
We will use 80% of the training set for classification, and save 20% of the training set for classifier validation.
## Copy original full dataset before splitting
SDSS_train_full <- SDSS_train
set.seed(456)
## Save 20% of training set for validation
val_set <- sample(nrow(SDSS_train), round(nrow(SDSS_train)*0.2))
SDSS_val <- SDSS_train_full[val_set,]
SDSS_train <- SDSS_train_full[-val_set,]
Before supervised classification, we make two tests. First, we show that random classification assignments will give an accuracy of 0.33 for three classes. This is obvious. Second, we try an unsupervised clustering algorithm, k-means partitioning. Since clustering algorithms performed poorly for the simpler red sequence
vs. blue cloud
galaxy groups above, we do not expect them to do well discriminating these overlapping, multivariate distributions with very non-Gaussian morphologies. The k-means procedure did not begin to separate the groups until k>5, but then it created false groupings amoung the normal stars. Altogether, k-means partitioning does a terrible job separating the three classes.
## Test of classification evaluation: random class assignment
set.seed(123)
SDSS_rand_train_pred <- sample(1:3, nrow(SDSS_train), replace=T)
par(mfcol=c(1, 1))
class_eval(SDSS_rand_train_pred, SDSS_train$Class, main="Random Assignment")
# Unsupervised k-means partitioning
SDSS.kmean <- kmeans(SDSS_test,6)
print(SDSS.kmean$centers)
plot(SDSS_test[,1], SDSS_test[,2], pch=20, cex=0.3, col=gray(SDSS.kmean$cluster/7),
xlab='u-g (mag)', ylab='g-r (mag)', xlim=c(-0.5,3), ylim=c(-0.6,1.5))
We now proceed with six multivariate classifiers commonly used in data mining: linear discriminant analysis; k-nearest neighbor classification; a simple neural network; Classification And Regression Trees; CART with Random Forests; and Support Vector Machine.
The R script is similar for each of the classification methods. We first bring the classifier into our R session, run it on the 80% training dataset, and save the results in an R object. We then produce two plots: one of the Sloan color-color diagrams with colors based on the new classifications; and a confusion matrix plot showing the predicted classes vs. the known classes for the 20% validation dataset. In some cases, we use R's predict
function to apply the classifier to new datasets.
Exercise 5: As you proceed with each classifier, read the help files and run str, print and plot on the output of these classifiers gain some understanding of its functioning.
# Linear discriminant analysis
library(MASS)
SDSS_lda <- lda(SDSS_train[,1:4], as.factor(SDSS_train[,5]))
SDSS_lda_train_pred <- predict(SDSS_lda)$class
SDSS_lda_val_pred <- predict(SDSS_lda, SDSS_val[,1:4])$class
SDSS_lda_test_pred <- predict(SDSS_lda, SDSS_test[,1:4])$class
plot(SDSS_val[,1],SDSS_val[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8), pch=20,
col=SDSS_lda_val_pred, cex=0.5, main='', xlab='u-g (mag)', ylab='g-r (mag)')
class_eval(SDSS_lda_val_pred, SDSS_val$Class, main="LDA Classification")
Surprisingly, linear discriminant analysis did not do poorly, even though it is limited to single 4-dimensional a hyperplane that optimally separates the classes assuming multivariate normal distributions. Even though the shapes are far from normal, and the separators are not linear, the classification accuracy is 93%.
# k-nn classification
library(class)
SDSS_knn_test_pred <- knn(SDSS_train[,1:4], SDSS_test, as.factor(SDSS_train[,5]), k=5, prob=T)
SDSS_knn_val_pred <- knn(SDSS_train[,1:4], SDSS_val[,1:4], as.factor(SDSS_train[,5]), k=5, prob=T)
plot(SDSS_val[,1], SDSS_val[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8), pch=20,
col=SDSS_knn_val_pred, cex=0.5, main='', xlab='u-g (mag)', ylab='g-r (mag)')
class_eval(SDSS_knn_val_pred, SDSS_val$Class, main="kNN Classification")
Here we see that the k-nearest neighbor algorithm -- a simple voting procedure among neighbors in 4-dimensional color space -- performed very well for this classification problem with only 2% misclassifications. Note that k-NN (like LDA and some other classifiers) depend on a distance matrix. This makes sense when (as in this case) all variables have the same units and similar range, but can be problematic when variables of very different units and ranges are combined. Also, there is a worry that the k-NN voting will depend on the artificial choice made of the sizes of the training sets (2000 for quasars, 2000 for white dwarfs, and 5000 for stars). As we will see when we apply a classifier to the test set, these sample numbers are far from realistic. For example, the test set has many fewer white dwarfs than the other classes, so there is a danger of biasing the votes with too many white dwarfs in the training set.
# Neural network
library(nnet)
SDSS_nnet <- nnet(as.factor(Class) ~ u_g + g_r + r_i + i_z, SDSS_train,size=5)
SDSS_nnet_train_pred <- predict(SDSS_nnet, type="class")
SDSS_nnet_val_pred <- predict(SDSS_nnet, SDSS_val, type="class")
SDSS_nnet_test_pred <- predict(SDSS_nnet, SDSS_test, type="class")
plot(SDSS_val[,1], SDSS_val[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8), pch=20,
col=SDSS_nnet_val_pred, cex=0.5, main='', xlab='u-g (mag)', ylab='g-r (mag)')
class_eval(SDSS_nnet_val_pred, SDSS_val$Class, main="Neural Net Classification")
Here we apply a very simple neural network with only one hidden layer with 5 nodes. Yet it gives a classification accuracy of 96%. Some quasars are incorrectly classified as horizontal branch giant stars. I nvestigation shows that the performance is improved with a larger and deeper network, with repeated back-propagation steps. This can take considerably more computing time.
# Classification And Regression Tree
library(rpart)
SDSS_rpart <- rpart(SDSS_train[,5] ~., data=SDSS_train[,1:4], method="class")
summary(SDSS_rpart)
str(SDSS_rpart)
SDSS_rpart_train_pred <- predict(SDSS_rpart, type="class")
SDSS_rpart_val_pred <- predict(SDSS_rpart, SDSS_val, type="class")
SDSS_rpart_test_pred <- predict(SDSS_rpart, SDSS_test, type="class")
plot(SDSS_val[,1], SDSS_val[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8), pch=20,
col=SDSS_rpart_val_pred, cex=0.5,
main='', xlab='u-g (mag)', ylab='g-r (mag)')
class_eval(SDSS_rpart_val_pred, SDSS_val$Class, main="Tree Classification")
This CART classification gave 95% accuracy. There are many options to CART, and performance may improve with tuning operating parameters. Below are two diagnostic diagrams, one showing the tree splits and the other the criterion for pruning the tree to ~12 nodes. A major advantage of CART is its single-variable splitting at each node that improves interpretability. The astronomer might be quite interested, for example, that the most important splits are at $u-g=0.79$, $u-g=2.83$, and $r-i=0.03$.
# Additional plots for decision tree
plot(SDSS_rpart, branch=0.5, margin=0.05)
text(SDSS_rpart, digits=3, use.n=T, cex=0.8)
plotcp(SDSS_rpart, lwd=2, cex.axis=1.3, cex.lab=1.3)
We now arrive at the two classification methods that gained the most widespread usage during the 2000-2019 decades: Random Forests and SVM.
# CART with Random Forests
library(randomForest)
SDSS_rf <- randomForest(as.factor(Class) ~ u_g + g_r + r_i + i_z, data=SDSS_train, mtry=2, importance=TRUE, do.trace=TRUE, ntree=100)
print(SDSS_rf)
SDSS_rf_train_pred <- predict(SDSS_rf, SDSS_train)
SDSS_rf_oob_pred <- predict(SDSS_rf)
SDSS_rf_val_pred <- predict(SDSS_rf, SDSS_val)
SDSS_rf_test_pred <- predict(SDSS_rf, SDSS_test)
plot(SDSS_val[,1], SDSS_val[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8), pch=20,
col=SDSS_rf_val_pred, cex=0.5,
main='', xlab='u-g (mag)', ylab='g-r (mag)')
class_eval(SDSS_rf_train_pred, SDSS_train$Class, main="Random Forest Classification")
class_eval(SDSS_rf_oob_pred, SDSS_train$Class, main="Random Forest Classification")
class_eval(SDSS_rf_val_pred, SDSS_val$Class, main="Random Forest Classification")
We have run Random Forest, where many decision trees are constructed and averaged using strategies of boosting and bagging. We run the classification evaluation on three subsets of the training data: the 80% training data (the 100% accuracy is not interesting here), the out of bag
(oob) subset representing a validation set internal to the Random Forest, and our 20% validation subset. These give 98% accuracy. Random Forest has many options, and the performance may be improvable.
# Support Vector Machine model, prediction and validation
library(e1071)
SDSS_svm <- svm((SDSS_train[,5]) ~.,data=SDSS_train[,1:4],cost=100, gamma=1)
SDSS_svm <- svm(as.factor(SDSS_train[,5]) ~.,data=SDSS_train[,1:4],cost=100, gamma=1)
summary(SDSS_svm)
SDSS_svm_train_pred <- predict(SDSS_svm)
SDSS_svm_val_pred <- predict(SDSS_svm, SDSS_val)
SDSS_svm_test_pred <- predict(SDSS_svm, SDSS_test)
plot(SDSS_test[,1], SDSS_test[,2], xlim=c(-0.7,3), ylim=c(-0.7,1.8), pch=20,
col=round(as.numeric(SDSS_svm_test_pred)), cex=0.5, main='',
xlab='u-g (mag)', ylab='g-r (mag)')
class_eval(SDSS_svm_val_pred, SDSS_val$Class, main="Random Forest Classification")
The Support Vector Machine calculation is computer intensive, as it performs many operations, some in higher dimensions. But the performance is excellent with 98% recovery of true classes in the 20% validation training set.
Exercise 6: Take one or more of the sophisticated classifiers -- neural nets, Random Forests, SVM -- and study its performance carefully for the problem here. Read the help files, and appropriate chapter in a book on machine learning. Examine the output R objects with str
and plot various quantities. Tweak various parameters of the method, and study the effect on the classifier performance.
We are now ready to stand back from our statistical analysis and examine the situation from a scientific (astronomical) viewpoint. The k-NN, Random Forest and SVM classifiers both showed excellent performance on the 20% validation dataset, with scientifically reasonable results on the test dataset. But there were some methodology questions about the k-NN procedure (e.g. effect of training set size on voting, value of k). So, it is reasonable to decide that the Random Forest and SVM classifiers are the best
.
Exercise 7: Produce the 2x2 contingency matrix -- True Positive, False Positive, True Negative, False Negative -- for the SVM and Random Forest classifiers. Calculate (and try to interpret) various scalar measures of its performance such as Accuracy, Sensitivity, Specificity, Precision & F1 Score; see details on Wikipedia here.
The last step is to bring the results out of R for use by software in other languages, by colleagues, and for publication in a journal. We need publication-quality graphics and a table of the Sloan test set (12K lines) with a new column showing the derived classification of each object. Note that we could, using library(xtable)
produce the table in LaTeX format.
# Final plot of test set results
par(mfrow=c(1,3))
plot(SDSS_test[,1], SDSS_test[,2], xlim=c(-0.7,3), col=round(as.numeric(SDSS_svm_test_pred)),
ylim=c(-0.7,1.8), pch=20, cex=0.5, main='Support Vector Machine', xlab='u-g (mag)',ylab='g-r (mag)')
plot(SDSS_test[,2], SDSS_test[,3], xlim=c(-0.7,1.8), col=round(as.numeric(SDSS_svm_test_pred)),
ylim=c(-0.7,1.8), pch=20, cex=0.5, main='Classification of', xlab='g-r (mag)',ylab='r-i (mag)')
plot(SDSS_test[,3], SDSS_test[,4], xlim=c(-0.7,1.8), col=round(as.numeric(SDSS_svm_test_pred)),
ylim=c(-1.1,1.3), pch=20, cex=0.5, main='SDSS Test Dataset', xlab='r-i (mag)',ylab='i-z (mag)')
par(mfrow=c(1,1))
# Write table of input data with output classifications onto disk
SDSS_test_svm_out <- cbind(SDSS[,6], SDSS[,7], SDSS_test, round(as.numeric(SDSS_svm_test_pred)))
names(SDSS_test_svm_out)[c(1,2,7)] <- c('R.A.', 'Dec', 'SVM Class')
write.table(format(SDSS_test_svm_out), file='SDSS_test_svm.out',sep='\t',quote=F)
Exercise 8: Random Forest is often performs very well, but more recent related methods — such as XGBoost and LightGBM — are more computationally efficient and may give better treatment of complex problems. LightGBM, for example, has won many competitions since 2017 and is prominent in the 2018 astronomical LSST PLAsTiCC challenge. The exercise here is to apply XGBoost and LightGBM to the SDSS point source classification problem. See CRAN packages xgboost and lightgbm with additional features in EIX and SHAPforxgboost.