Home / Papers / Applied Logistic Regression

Applied Logistic Regression

2801 Citations2002
Joseph D. Conklin
Technometrics

No TL;DR found

Abstract

As a consultant, I am always on the lookout for new books that help me do my job better. Iwould recommend practitioners of regression, that is, probably most of us, to read and use this book. Anthony Atkinson and Marco Riani develop a novel methodology for examining the effect each observation has on the Ž tted regression model. Robust Ž tting procedures are combined with regression diagnostics, graphics, and a “forward” processing through the observations to provide a new way of identifying in uential and/or outlier observations while simultaneously determining the best Ž tting model. The method is initially introduced for simple linear regression, but individual chapters are devoted to applying the methodology to nonlinear models in general and generalized linear models in particular. The role of data transformations to normality is also explored. Throughout the book, a large number of fully worked examples provide the reader real insight to the power of the methodology. Theory is kept to a minimum and matrix notation is used throughout. Chapter 1 presents some regression examples that illustrate the need for a methodology to identify and account for outliers in simple and multiple regression. Chapter 2 starts with an appropriately short introduction to least squares estimation and associated hypothesis testing. The authors next derive many of the more common in uence diagnostics and introduce a “mean shift outlier” model wherein a dummy variable is used in the linear model to assess the impact of a speciŽ c observation on parameter estimation and model Ž t. This tool for examining the effect of a single deletion is then placed in the context of a “forward search” algorithm. The forward search algorithm is made up of three steps. The Ž rst step addresses the choice of an initial subset of the observations that will be used with a robust estimation procedure (least medians of squares or LMS) to provide an initial model Ž t and a “good” estimate of the residual error. If the model contains p parameters, the initial subset will be the one subset out of the nCp potential subsets that minimizes the sum of squared residuals from the LMS Ž t. If there are too many potential subsets, the choice will be made after examining a large sample of subsets of size p. In the second step, the size of the subset is incrementally increased; at each increase the subset with the smallest sum of squared residuals is kept. Typically this only requires one new observation to enter the Ž tting set, but there may be cases where some observations drop out and are replaced by others not originally in the subset. Note that as the subset size increases, those observations outside the Ž tting subset look less and less like outliers. What is important about this approach is that it starts with a subset that is assumed to be outlier free, or it contains unmasked outliers that will be replaced as the subset size increases. This second step is repeated until all observations are included in the Ž tting subset. The third step of the algorithm is monitoring changes in Ž t statistics, speciŽ cally the residual mean square, parameter estimates, associated t-statistics, and even diagnostics, such as Cook’s distance, as observations are incrementally added to the Ž tting subset. Typically the residual mean square will increase as subset size increases, but the increase will be smooth. Outliers, being added at the end of the Step 2 process, will tend to produce dramatic changes in these Ž t statistics, hence making identiŽ cation easy. For smaller samples it is even possible to monitor the value of all residuals from the Ž t of the current model. Large residual values for observations not in the current model are indicative of potential outliers. The remainder of the book is devoted to the application and further development of this basic methodology. In Chapter 3, the algorithm is applied to a number of fairly common multiple regression problems to illustrate the power of the algorithm. Chapter 4 addresses the impact of normality transformations on outlier detection. Chapters 5 and 6 extend the methodology to nonlinear least squares and generalized linear models respectively. The authors, in conjunction with Kjell Konis of StatSci (now Insightful), have provided a web site (http://stat.econ.unipr.it/riani/ar) that contains S-Plus functions enabling the user to implement the methodology presented in the book. The functions do the analysis as claimed and are as easy to use as are most S-Plus functions. This is certainly a tool I plan to use extensively in the future.