Home / Papers / A Tutorial in Logistic Regression

A Tutorial in Logistic Regression

DOI: 10.2307/353415Semantic Scholar

214 Citations•1995•

A. DeMaris

Journal of Marriage and Family

No TL;DR found

Abstract

Logistic regression has, in recent years, become the analytic technique of choice for the multivariate modeling of categorical dependent variables. Nevertheless, for many potentird users this procedure is still relatively arcane. This artile is therefore designed to render this technique more accessible to practicing researchers by comparing it, where possible, to linear regression. I will begin by discussing the modeling of a binary dependent variable. Then I will show the modeling of polytomous dependent variables, considering cases in which the values are alternately unordered, then ordered. Techniques are illustrated throughout using data from the 1993 General Social Survey (GSS). Because these data are widely available, the reader is encouraged t replicate the analyses shown so that he or she can receive a "hands on" tutorial in the techniques. The Appendix presents coding instructions for an exact replication of all analyses in the paper. BINARY DEPENDENT VARIABLES A topic that has intrigued several family researchers is the relationship of marital status to subjective well-being. One indicator of well-being is reported happiness, which will be the focus of our analyses. In the GSS happiness is assessed by a question asking, "Taken all together, how would you say things are these days--would you say that you are very happy, pretty happy, or not too happy?" Of the total of 1,606 respondents in the 1993 survey, five did not answer the question. Hence, all analyses in this article are based on the 1,601 respondents providing valid answers to this question. Because this item has only three values, it would not really be appropriate to treat it as interval. Suppose instead, then, that we treat it as dichotomous, coding the variable 1 for those who are not too happy, and 0 otherwise. The mean of this binary variable is the proportion of those in the sample who are "unhappy," which is 178/1,601, or .111. The corresponding proportion of unhappy people in the population, denoted by x, can also be thought of as the probability that a randomly selected person will be unhappy. My focus will be on modeling the probability of unhappiness as a function of marital status, as well as other social characteristics. Why Not OLS? One's first impulse would probably be to use linear regressian, with E(Y) = pi as the dependent variable. The equation would be pi = alpha + beta sub 1 chi sub 1 + beta sub 2 chi sub 2 + ... beta sub K chi sub K . (There is no error term here because the equation is for the expected value of Y, which, of course, is pi.) However, the problems incurred in using OLS have been amply documented (Aldrich & Nelson; 1984; Hanushek Jackson, 1977; Maddala, 983). Three difficulties are paramount: the use of a linear function, the assumption of independence beteen the predictors and the error term, and error heteroskedasticity, or nonconstant variance of the errors across combinatiorrs of predictor values. Briefly, the use of a linear function is problematic because it leads to predicted probabilities outside the range of 0 to 1. The reason for this is that the right-hand side of the regression equation, alpha + Sigma beta sub K chi sub K , is not restricted to fall between 0 and 1, whereas the left-hand side, pi, is. The pseudo-isolation condition (Boen, 1989), requiring the error term to be uncorlated with the predictors, is violated in OLS when a binary dependent variable is used (see Hanushek & Jackson, 1977, or McKelvey & Zavoina, 1975, for a detailed exposition of why this happens). Finally, the error term is inherently heteroskedastic because the error variance is R(1-7c). In that K varies with the values of the predictors, so does the error variance. The Logistic Regression Model Motivation to use the logistic regression model can be generated in one of two ways. The first is through a latent variable apptoach. This is particularly relevant for understanding standardized coefficients and one of the R sup 2 analogs in logistic regression. …