Regression

Summer School in Statistics for Astronomers

Eric D. Feigelson Spring 2021

In this tutorial, we exercise two types of regression using R and its CRAN packages. First, we consider cases where a global mathematical function is assumed to apply to the full dataset. These include linear and nonlinear regression with options such as weighting my measurement errors and robust treatment of outliers. A crucial aspect is goodness-of-fit and residual analysis to confirm that the best-fit model is indeed a good fit. Second we consider several local regression methods where a continuous curve is estimated from sequential pieces of the data.

1. Linear modeling

We start by running one of the most widely used functions in R, lm for linear modeling, and related procedures. But vefore we start, it is important to note that, in statistical parlance, 'linear' refers to a much broader range of models than $Y = \beta_0 + \beta_1 X + \epsilon$. It includes any model that is linear in the model parameters. Thus the following models are 'linear' and an encompassed by the powerful theorems underlying linear modeling:

In contrast the following models are nonlinear:

We will practice linear modeling with a collection of photometry of spectroscopically confirmed quasars from the Sloan Digital Sky Sruvey. We examine a relationship between the magnitdues in two bands; this is scientifically rather useless, but gives opportunity to test methodology for simple linear regression with difficulties common in astronomical regressions: non-Gaussian scatter, heteroscedastic measurement errors, and outliers.

Note that the theory of linear modeling is not restricted to bivariate problems: it is intrinsically multivariate in the sense that a single response variable $Y$ can be a function of a vector of covariates $\bf{X}$ as in the simple model: $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon$.

When applied to a data.frame, the R function attach allows the user to access a column by its names (e.g. r_mag) without remembering their column number (e.g. qso[,3]).

Here we see a disturbed dataset. There is wide asymmetrical scatter towards main magnitudes in the SDSS u (ultraviolet) band. Much of the scatter is attributable to measurement errors, but not all of it. Let us now run lm with the formula: $u = \beta_0 + \beta_1 i + \epsilon$, obtain 90% confidence interval of the intercept and slope parameters, and show the scatter plot with the fitted line.

Note that, since the scatter is non-Gaussian, the theorems underlying ordinary least squares means that the fit is not a maximum likelihood estimator and the parameter uncertainties may not be reliable. Visually, this is a terrible fit, missing most of the data points. We now try to improve it in three ways: weighting by measurement errors; applying robust downweighting of outliers; and applying both corrections.

This is the minimum chi-squared regression commonly used by astronomers; statisticians would call it weighted least squares regression with weights from heteroscedastic measurement errors. Here we see that most of the problems can be removed with measurement error weighting. However, the calculation is obviously wrong in this case because (in statistical parlance) the model is misspecified because only a portion of the scatter is caused by measurement error.

Astronomers often do not carefully examine the accuracy and validity of their regression fits. Diagnostic graphics are very useful for this. Here are the plots produced automatically by R's lm function. For interpretation and details, see the text A Modern Approach to Regression with R (S. Sheather, 2009).

Other regression diagnostic tests can be applied to help evaluate the validity of the statistical model and its best-fit parameters. The most important is a nonparametric two-sample test between the cumulative distribution functions of the observed and fitted values of the response variable. The Kolmogorov-Smirnov test is commonly used, but the Cramer-von Mises and Anderson-Darling tests are more sensitive. However, theorems show that tabulated probabilities are inapplicable when the second distribution is from a model based on the first distribution. We thus use bootstrap replications to estimate probabilities. This capability is provided by CRAN package 'cramer' for the Cramer-von Mises test. To avoid excessive computational time, we only treat the first 1000 points here. The weighted linear fit is obviously rejected by the test.

Another approach to non-Gaussianity and outliers is to apply robust regression techniques. These are many variants; here R's rlm (robust linear modeling) function, downweighting outliers using Huber's psi function, with and without measurement error weighting. Unfortunately, this code does not have a built-in line plotting option, so we draw the lines manually from information in the rlm output. See various approaches in R at the CRAN Task View on Robust Statistics.

Here we see that a robust regression line that downweights outliers treated most of the problem with bad u-band photometry. And a line that treats both outliers and measurement errors did a great job (orange line). This is probably the preferred result for this problem. But of course, the model is misspecified as the residuals are far from homoscedastic Gaussian noise, so the scientific meaning of the result (slope = 1.02) is suspect.

Exercise 1. Try different robustification options in rlm such as MM estimation and different choices of removal of outlying points (options in lqs).

2. Nonlinear regression

Astronomers often fit data with nonlinear functions derived from astrophysical theory that we believe apply to the observed situation. Among important recent applications of nonlinear regression are the fitting of the consensus Lambda-CDM model of cosmology to the fluctuations in the cosmic microwave background (images above, WMAP results) and the fitting of Keplerian exoplanetary orbits to stellar radial velocity time series.

But astronomers also often fit data with heuristic nonlinear functions that do not have a clear astrophysical interpretation such as the stellar Initial Mass Function (several distributions used), galaxy luminosity function (Schechter function = gamma distribution), Navarro-Frenk-White Dark Matter profile, and various galaxy scaling relations.

Here we fit radial profiles from nearby Virgo Cluster elliptical galaxies to a heuristic nonlinear function proposed by Jose Luis Sersic in 1968. Here the surface brightness $B$ of an elliptical galaxy (or spiral galaxy bulge) as a function of radius $r$ follows: $B \propto -2.5*log(I_e * 10^{-(0.868*n-0.142)}((r/r_e)^{1/n}-1))$. The data are obtained from Kormendy et al. 2009. We fit using R's nls (nonlinear least squares) function; see also CRAN package nmle for maximum likelihood fitting.

We can examine various scalar quantities from the nls fit, and plot the residuals between the data and model. A nonparametric smoother is added to assist seeing the amazing structure in the residuals: periodic shells of stars in excess of the monotonic Sersic model. This is a well-known effect due to past galaxy mergers that form large elliptical galaxies. A similar residual plot appears in Kormendy's paper.

We can perform more analysis of the residuals. First, we show the residuals are normally distributed (Shapiro-Wilks test) but exhibit strong spatial autocorrelation (Durbin-Watson test).

There is an oddity: the error on Sersic's n parameter from nls is much smaller than the error quoted by Kormendy. Reading Kormendy's appendix, I find that he did not know how to evaluate the uncertainty of a nonlinear fit and chose an ad hoc procedure overestimated the error. His estimate of n was much more accurate than he thought.

Exercise 2. (a) See whether the best-fit model is significantly different using maximum likelihood estimation (CRAN package nmle) rather than Iteratively Weighted Least Squares (nls in R). (b) Estimate parameter confidence intervals using bootstrap techniques using CRAN nlstools.

Nonparametric regression

We now turn to the common problem where there is no clearly applicable astrophysical model or simple heuristic function to describe the behavior of the data. Perhaps we seek to characterize how a response variable $Y$ depends on one or two covariates $X$, but the data exhibit some ill-defined nonlinear relationship and/or non-Gaussian (even asymmetrical) scatter. This is a common situation in extragalactic and stellar astronomy.

In such cases, we are guided by the famous phrase 'Let the data speak for themselves'. Our goal is to form continuous curves (or in 2+ dimension, continuous surfaces) from many individual data points that reveal the behavior. However, no global formula for the curve is obtained; just a continuous curve where $Y$ can be evaluated at any value of $X$.

The statistical approach to this problem is commonly called local regression or nonparametric regression, although it is probably more accurate to say semi-parametric regression because simple functions (such as Gaussians or polynomials) are often used locally. These methods are also related to smoothing procedures such as kernel density estimation (e.g. convolving with a Gaussian) but are applied here to situations where a response variable $Y$ is a function of one (or a few) covariates $X$.

We start with a classical cubic smoothing spline fit. This function is based on code in R's function smooth.spline which used the Fortran program GAMFIT by Stanford statisticians T. Hastie and R. Tibshirani. The problem is the choice of knot locations; with the default settings, too many knots are used here.

We then proceed with a more modern spline regression that prunes knots based on a Bayesian Information Criterion likelihood measure, and computes spline function for any quantile of the dispersion in the response variable. This is combination of spline and quantile regression implemented CRAN package cobs which has a strong methodological foundation (Ng 199600044-5), He & Ng 1999, Ng & Maechler 2007). The cobs CRAN package has been used in several recent astronomical papers. For example, the figure above shows spline fits to the median $\pm$68% and $\pm$95% quantiles for H$\alpha$ extinction as a function of surface star formation rates for 0.6M spaxels in 977 galaxies (Li et al. 2019). In my opinion, the method in cobs can be useful for many astronomical studies.

We proceed with four well-established bivariate semi-parametric local regression estimators. Note that, in some procedures, we need to specify a smoothing bandwidth (e.g. scan in loess, nn in locfit) without mathematical guidance.

  1. LOESS, 'LOcal regrESSion' in base-R, widely used (0.3M Google hits). Local polynomial regression with robust treatment of outliers. This is a widely used smoother since the 1980s presented in Wikipedia and the book by W. S. Cleveland, Visualizing Data (1993).

  2. Nonparametric regression with bootstrap errors in CRAN package np by econometricians Hayfield & Racine (2008).

  3. Likelihood-based locall regression in CRAN package locfit, widely used (0.6M Google hits, >200 downloads/day). Local kernel regression methods including heteroscedastic weighting (unequal error bars), censoring (upper limits), and outliers. Presented in the book Local Regression and Likelihood, C. Loader, 2009 with 1700 citations. This is probably the most capable local regression package with a strong theorem-based foundation; I recommend it highly to astronomers.

  4. Gaussian Process regression, commonly known as kriging in geostatistics since the 1960s. Response variable errors and independent variable covariance are assumed to be normal. Maximum likelihood & Bayesian estimation. Astronomers often read about GP regression in the book by C. Rasmussen & C. Williams Gaussian Processes for Machine Learning, 2006 but can also benefit from the developments in kriging in the volume by J.-P. Chiles & P. Delfiner Geostatistics: Modeling Spatial Uncertainty, 2nd ed., 2012. Other Gaussian processes regression codes are given in CRAN packages mlegp (Maximum Likelihood Estimates of Gaussian Processes) and gptk (Gaussian Processes Tool-Kit). See also a 2012 blog tutorial.

Finally, we plot all of these nonparametric regressions on the same plot, and we save one of the regression estimators in our working directory using R's predict function. We find that, in this case, all four estimators are quite compatible with each other.

Exercise 3. Repeat the local regression analysis above after introducing one or more gaps in the distribution of points along the X axis. Do the different local regression estimators produce compatible interpolations through the gaps?

Some useful books for regression