Jump to content

Dummy variable (statistics): Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Reference one-hot encoding alternative nomenclature
→‎See also: add {{Annotated link}} for short descriptions
 
(48 intermediate revisions by 31 users not shown)
Line 1: Line 1:
{{short description|Numeric stand-ins in regression analysis}}
In [[statistics]] and [[econometrics]], particularly in [[regression analysis]], a '''dummy variable''' (also known as an '''indicator variable''', '''design variable''', '''[[one-hot encoding]]''', '''Boolean indicator''', '''binary variable''', or '''qualitative variable'''<ref name="G & S"/><ref name=Gujarati/>) is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.<ref>Draper, N.R.; Smith, H. (1998) ''Applied Regression Analysis'', Wiley. {{ISBN|0-471-17082-8}} (Chapter 14)</ref><ref name="Interpreting Coefficients">{{cite web|title=Interpreting the Coefficients on Dummy Variables|url=http://users.rcn.com/alancm/pp605/Interpreting_Dummy_Coefficients.pdf}}</ref> Dummy variables are used as devices to sort data into [[Mutually exclusive events|mutually exclusive]] categories (such as smoker/non-smoker, etc.).<ref name=Gujarati>{{cite book|last=Gujarati|first=Damodar N|title=Basic econometrics|year=2003|publisher=McGraw Hill|isbn=0-07-233542-4|pages=1002|url=http://www.mhhe.com/gujarati4e}}</ref> For example, in [[econometrics|econometric]] [[time series analysis]], dummy variables may be used to indicate the occurrence of wars or major [[Strike action|strikes]]. A dummy variable can thus be thought of as a [[truth value]] represented as a numerical value 0 or 1 (as is sometimes done in computer programming).
{{about|the usage in statistics|the usage in computing and math|Bound variable}}


In [[regression analysis]], a '''dummy variable''' (also known as '''indicator variable''' or just '''dummy''') is one that takes a [[Binary data|binary value]] (0 or 1) to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.<ref>Draper, N.R.; Smith, H. (1998) ''Applied Regression Analysis'', Wiley. ISBN 0-471-17082-8 (Chapter 14)</ref> For example, if we were studying the relationship between [[Sex|biological sex]] and [[income]], we could use a dummy variable to represent the sex of each individual in the study. The variable could take on a value of 1 for [[Male|males]] and 0 for [[Female|females]] (or vice versa). In [[machine learning]] this is known as [[One-hot#Machine learning and statistics|one-hot encoding]].
Dummy variables are "proxy" variables or numeric stand-ins for [[Qualitative data|qualitative]] facts in a [[Regression analysis|regression model]]. In regression analysis, the [[Dependent and independent variables|dependent variables]] may be influenced not only by quantitative variables (income, output, prices, etc.), but also by qualitative variables (gender, religion, geographic region, etc.). A dummy [[Dependent and independent variables|independent variable]] (also called a dummy explanatory variable) which for some observation has a value of 0 will cause that variable's [[coefficient]] to have no role in influencing the [[Dependent and independent variables|dependent variable]], while when the dummy takes on a value 1 its coefficient acts to alter the intercept. For example, suppose membership in a group is one of the qualitative variables relevant to a regression. If group membership is arbitrarily assigned the value of 1, then all others would get the value 0. Then the intercept (the value of the dependent variable if all other explanatory variables hypothetically took on the value zero) would be the constant term for non-members but would be the constant term plus the coefficient of the membership dummy in the case of group members.<ref name="G & S">{{cite web|last=, Asha Sharma|first=Susan Garavaglia|title=A SMART GUIDE TO DUMMY VARIABLES: FOUR APPLICATIONS AND A MACRO|url=http://www.ats.ucla.edu/stat/sas/library/nesug98/p046.pdf}}</ref>


Dummy variables are commonly used in regression analysis to represent categorical variables that have more than two levels, such as education level or occupation. In this case, multiple dummy variables would be created to represent each level of the variable, and only one dummy variable would take on a value of 1 for each observation. Dummy variables are useful because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. They can also help us to control for confounding factors and improve the validity of our results.
Dummy variables are used frequently in [[time series analysis]] with regime switching, seasonal analysis and qualitative data applications. Dummy variables are involved in studies for [[economic forecasting]], bio-medical studies, [[Credit score|credit scoring]], response modelling, etc. Dummy variables may be incorporated in traditional regression methods or newly developed modeling paradigms.<ref name="G & S"/>


As with any addition of variables to a model, the addition of dummy variables will increase the within-sample model fit ([[coefficient of determination]]), but at a cost of fewer [[Degrees of freedom (statistics)|degrees of freedom]] and loss of generality of the model (out of sample model fit). Too many dummy variables result in a model that does not provide any general conclusions.
==Incorporating a dummy independent==
[[File:Graph showing Wage = α0 + δ0female + α1education + U, δ0 0.jpg|thumb|right|400px|Figure 1 : Graph showing wage = α<sub>0</sub> + δ<sub>0</sub>female + α<sub>1</sub>education + ''U'', δ<sub>0</sub>&nbsp;<&nbsp;0.]]


Dummy variables are useful in various cases. For example, in [[econometrics|econometric]] [[time series analysis]], dummy variables may be used to indicate the occurrence of wars, or major [[Strike action|strikes]]. It could thus be thought of as a [[Boolean data type|Boolean]], i.e., a [[truth value]] represented as the numerical value 0 or 1 (as is sometimes done in [[computer programming]]).
Dummy variables are incorporated in the same way as quantitative variables are included (as explanatory variables) in regression models. For example, if we consider a [[Mincer earnings function|Mincer-type]] regression model of wage determination, wherein wages are dependent on gender (qualitative) and years of education (quantitative):


Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the [[panel data]] [[fixed effects estimator]] dummies are created for each of the units in [[cross-sectional data]] (e.g. firms or countries) or periods in a [[pooled time-series]]. However in such regressions either the [[constant term]] has to be removed, or one of the dummies removed making this the base category against which the others are assessed, for the following reason:
:<math>\ln \text{wage} = \alpha_{0} + \delta_{0} \text{female} + \alpha_{1} \text{education} + u</math>


If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect [[multicollinearity]],<ref>{{cite journal|first=Daniel B.|last=Suits|year=1957|title=Use of Dummy Variables in Regression Equations|jstor=2281705|journal=Journal of the American Statistical Association|volume=52|issue=280|pages=548–551}}</ref> so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the '''dummy variable trap'''.
where <math>u \sim N(0, \sigma^{2})</math> is the [[Errors and residuals in statistics|error term]]. In the model, ''female'' = 1 when the person is a female and ''female'' = 0 when the person is male. <math>\delta_{0}</math> can be interpreted as: the difference in wages between females and males, holding education constant. Thus, δ<sub>0</sub> helps to determine whether there is a discrimination in wages between males and females. For example, if δ<sub>0</sub>>0 (positive coefficient), then women earn a higher wage than men (keeping other factors constant). Note that the coefficients attached to the dummy variables are called '''differential intercept coefficients'''. The model can be depicted graphically as an intercept shift between females and males. In the figure, the case δ<sub>0</sub><0 is shown (wherein, men earn a higher wage than women).<ref name=Wooldridge>{{cite book|last=Wooldridge|first=Jeffrey M|title=Introductory econometrics: a modern approach|year=2009|publisher=Cengage Learning|isbn=0-324-58162-9|pages=865|url=https://books.google.com/books?id=64vt5TDBNLwC&dq=introductory+econometrics+wooldridge}}</ref>

Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: <math>D_{1} = 1</math> if the observation is for summer, and equals zero otherwise; <math>D_{2}=1</math> if and only if autumn, otherwise equals zero; <math>D_{3}=1</math> if and only if winter, otherwise equals zero; and <math>D_{4}=1</math> if and only if spring, otherwise equals zero. In the [[panel data]], [[fixed effects estimator]] dummies are created for each of the units in [[cross-sectional data]] (e.g. firms or countries) or periods in a pooled time-series. However in such regressions either the [[constant term]] has to be removed or one of the dummies has to be removed, with its associated category becoming the base category against which the others are assessed in order to avoid the '''dummy variable trap''':

The constant term in all regression equations is a coefficient multiplied by a regressor equal to one. When the regression is expressed as a matrix equation, the matrix of regressors then consists of a column of ones (the constant term), vectors of zeros and ones (the dummies), and possibly other regressors. If one includes both male and female dummies, say, the sum of these vectors is a vector of ones, since every observation is categorized as either male or female. This sum is thus equal to the constant term's regressor, the first vector of ones. As result, the regression equation will be unsolvable, even by the typical pseudoinverse method. In other words: if both the vector-of-ones (constant term) regressor and an exhaustive set of dummies are present, perfect [[multicollinearity]] occurs,<ref>{{cite journal|first=Daniel B.|last=Suits|year=1957|title=Use of Dummy Variables in Regression Equations|jstor=2281705|journal=Journal of the American Statistical Association|volume=52|issue=280|pages=548–551|doi=10.1080/01621459.1957.10501412 }}</ref> and the system of equations formed by the regression does not have a unique solution. This is referred to as the '''dummy variable trap'''. The trap can be avoided by removing either the constant term or one of the offending dummies. The removed dummy then becomes the base category against which the other categories are compared.

==ANOVA models==

{{Main|Analysis of variance}}

A regression model in which the dependent variable is quantitative in nature but all the explanatory variables are dummies (qualitative in nature) is called an ''Analysis of Variance'' (ANOVA) model.<ref name=Gujarati/>

===ANOVA model with one qualitative variable===

Suppose we want to run a regression to find out if the average annual salary of public school teachers differs among three geographical regions in Country A with 51 states: (1) North (21 states) (2) South (17 states) (3) West (13 states). Say that the simple arithmetic average salaries are as follows: $24,424.14 (North), $22,894 (South), $26,158.62 (West). The arithmetic averages are different, but are they statistically different from each other? To compare the mean values, [[Analysis of variance|Analysis of Variance]] techniques can be used.
The regression model can be defined as:

:<math>Y_{i} = \alpha_{1} + \alpha_{2} D_{2i} + \alpha_{3} D_{3i} + u_{i}</math>,

where

: <math>Y_{i} =</math> average annual salary of public school teachers in state i
: <math>D_{2i} = 1</math> if the state ''i'' is in the North Region
:: <math>D_{2i} = 0</math> otherwise (any region other than North)
: <math>D_{3i} = 1</math> if the state ''i'' is in the South Region
:: <math>D_{3i} = 0</math> otherwise

In this model, we have only qualitative regressors, taking the value of 1 if the observation belongs to a specific category and 0 if it belongs to any other category. This makes it an ANOVA model.

[[File:Anova graph.jpg|thumb|left|400px|Figure 2 : Graph showing the regression results of the ANOVA model example: Average annual salaries of public school teachers in 3 regions of Country A.]]

Now, taking the [[Expected value|expectation]] of both sides, we obtain the following:

Mean salary of public school teachers in the North Region:

'''E(''Y''<sub>''i''</sub>|''D''<sub>2''i''</sub> = 1, ''D''<sub>3''i''</sub> = 0) = α<sub>1</sub> + α<sub>2</sub>'''

Mean salary of public school teachers in the South Region:

'''E(Y<sub>i</sub>|D<sub>2i</sub> = 0, D<sub>3i</sub> = 1) = α<sub>1</sub> + α<sub>3</sub>'''

Mean salary of public school teachers in the West Region:

'''E(Y<sub>i</sub>|D<sub>2i</sub> = 0, D<sub>3i</sub> = 0) = α<sub>1</sub> '''

(The error term does not get included in the expectation values as it is assumed that it satisfies the usual [[Least squares|OLS]] conditions, i.e., E(u<sub>i</sub>) = 0)

The expected values can be interpreted as follows: The mean salary of public school teachers in the West is equal to the intercept term α<sub>1</sub> in the multiple regression equation and the differential intercept coefficients, α<sub>2</sub> and α<sub>3</sub>, explain by how much the mean salaries of teachers in the North and South Regions vary from that of the teachers in the West. Thus, the mean salaries of teachers in the North and South is ''compared'' against the mean salary of the teachers in the West. Hence, the West Region becomes the '''base group''' or the '''benchmark group''',i.e., the group against which the comparisons are made. The '''omitted category''', i.e., the category to which no dummy is assigned, is taken as the base group category.

Using the given data, the result of the regression would be:

: ''Ŷ''<sub>''i''</sub> = 26,158.62 − 1734.473D<sub>2''i''</sub> − 3264.615D<sub>3''i''</sub>

se = (1128.523) (1435.953) (1499.615)

t = (23.1759) (−1.2078) (−2.1776)

p = (0.0000) (0.2330) (0.0349)

R<sup>2</sup> = 0.0901

where, se = [[Standard error (statistics)|standard error]], ''t'' = [[t-statistic]]s, ''p'' = [[p value]]

The regression result can be interpreted as: The mean salary of the teachers in the West (base group) is about $26,158, the salary of the teachers in the North is lower by about $1734 ($26,158.62 − $1734.473 = $24,424.14, which is the average salary of the teachers in the North) and that of the teachers in the South is lower by about $3265 ($26,158.62 − $3264.615 = $22,894, which is the average salary of the teachers in the South).

To find out if the mean salaries of the teachers in the North and South are statistically different from that of the teachers in the West (the comparison category), we have to find out if the slope coefficients of the regression result are [[Statistical significance|statistically significant]]. For this, we need to consider the ''p'' values. The estimated slope coefficient for the North is not statistically significant as its ''p'' value is 23 percent; however, that of the South is statistically significant at the 5% level as its ''p'' value is only around 3.5 percent. Thus the overall result is that the mean salaries of the teachers in the West and North are not statistically different from each other, but the mean salary of the teachers in the South is statistically lower than that in the West by around $3265. The model is diagrammatically shown in Figure 2. This model is an ANOVA model with one qualitative variable having 3 categories.<ref name=Gujarati/>

===ANOVA model with two qualitative variables===

Suppose we consider an ANOVA model having two qualitative variables, each with two categories: Hourly Wages are to be explained in terms of the qualitative variables Marital Status (Married / Unmarried) and Geographical Region (North / Non-North). Here, Marital Status and Geographical Region are the two explanatory dummy variables.<ref name=Gujarati/>

Say the regression output on the basis of some given data appears as follows:

:'''Ŷ<sub>i</sub> = 8.8148 + 1.0997D<sub>2</sub> − 1.6729D<sub>3</sub>'''

where,

:''Y'' = hourly wages (in $)

:''D''<sub>2</sub> = marital status, 1 = married, 0 = otherwise

:''D''<sub>3</sub> = geographical region, 1 = North, 0 = otherwise

In this model, a single dummy is assigned to each qualitative variable, one less than the number of categories included in each.

Here, the base group is the omitted category: Unmarried, Non-North region (Unmarried people who do not live in the North region). All comparisons would be made in relation to this base group or omitted category. The mean hourly wage in the base category is about $8.81 (intercept term). In comparison, the mean hourly wage of those who are married is higher by about $1.10 and is equal to about $9.91 ($8.81 + $1.10). In contrast, the mean hourly wage of those who live in the North is lower by about $1.67 and is about $7.14 ($8.81 − $1.67).

Thus, if more than one qualitative variable is included in the regression, it is important to note that the omitted category should be chosen as the benchmark category and all comparisons will be made in relation to that category. The intercept term will show the expectation of the benchmark category and the slope coefficients will show by how much the other categories differ from the benchmark (omitted) category.<ref name=Gujarati/>

==ANCOVA models==

{{Main|Analysis of covariance}}

A regression model that contains a mixture of both quantitative and qualitative variables is called an ''[[Analysis of covariance|Analysis of Covariance]]'' (ANCOVA) model. ANCOVA models are extensions of ANOVA models. They statistically control for the effects of quantitative explanatory variables (also called covariates or control variables).<ref name=Gujarati/>

To illustrate how qualitative and quantitative regressors are included to form ANCOVA models, suppose we consider the same example used in the ANOVA model with one qualitative variable: average annual salary of public school teachers in three geographical regions of Country A. If we include a quantitative variable, ''State Government expenditure on public schools per pupil'', in this regression, we get the following model:

[[File:Ancova graph.jpg|thumb|right|400px|Figure 3 : Graph showing the regression results of the ANCOVA model example: Public school teacher's salary (Y) in relation to State expenditure per pupil on public schools.]]

:'''Y<sub>i</sub> = α<sub>1</sub> + α<sub>2</sub>D<sub>2i</sub> + α<sub>3</sub>D<sub>3i</sub> + α<sub>4</sub>X<sub>i</sub> + U<sub>i</sub>'''

where,

:Y<sub>i</sub> = average annual salary of public school teachers in state i

:X<sub>i</sub> = State expenditure on public schools per pupil

:D<sub>2i</sub> = 1, if the State i is in the North Region

::D<sub>2i</sub> = 0, otherwise

:D<sub>3i</sub> = 1, if the State i is in the South Region

::D<sub>3i</sub> = 0, otherwise

Say the regression output for this model is

:'''Ŷ<sub>i</sub> = 13,269.11 &minus; 1673.514D<sub>2i</sub> &minus; 1144.157D<sub>3i</sub> + 3.2889X<sub>i</sub>'''

The result suggests that, for every $1 increase in State expenditure per pupil on public schools, a public school teacher's average salary goes up by about $3.29. Further, for a state in the North region, the mean salary of the teachers is lower than that of West region by about $1673 and for a state in the South region, the mean salary of teachers is lower than that of the West region by about $1144. Figure 3 depicts this model diagrammatically. The average salary lines are parallel to each other by the assumption of the model that the coefficient of expenditure does not vary by state. The trade off shown separately in the graph for each category is between the two quantitative variables: public school teachers' salaries (Y) in relation to State expenditure per pupil on public schools (X).<ref name=Gujarati/>

==Interactions among dummy variables==

Quantitative regressors in regression models often have an [[Interaction (statistics)|interaction]] among each other. In the same way, qualitative regressors, or dummies, can also have interaction effects between each other, and these interactions can be depicted in the regression model. For example, in a regression involving determination of wages, if two qualitative variables are considered, namely, gender and marital status, there could be an interaction between marital status and gender.<ref name=Wooldridge/> These interactions can be shown in the regression equation as illustrated by the example below.

With the two qualitative variables being gender and marital status and with the quantitative explanator being years of education, a regression that is purely linear in the explanators would be

:'''Y<sub>i</sub> = β<sub>1</sub> + β<sub>2</sub>D<sub>2,i</sub> + β<sub>3</sub>D<sub>3,i</sub> + αX<sub>i</sub> + U<sub>i</sub>'''

where

:i denotes the particular individual

:Y = Hourly Wages (in $)

:X = Years of education

:D<sub>2</sub> = 1 if female, 0 otherwise

:D<sub>3</sub> = 1 if married, 0 otherwise

This specification does not allow for the possibility that there may be an interaction that occurs between the two qualitative variables, D<sub>2</sub> and D<sub>3</sub>. For example, a female who is married may earn wages that differ from those of an unmarried male by an amount that is not the same as the sum of the differentials for solely being female and solely being married. Then the effect of the interacting dummies on the mean of Y is not simply ''additive'' as in the case of the above specification, but ''multiplicative'' also, and the determination of wages can be specified as:

:'''Y<sub>i</sub> = β<sub>1</sub> + β<sub>2</sub>D<sub>2,i</sub> + β<sub>3</sub>D<sub>3,i</sub> + β<sub>4</sub>(D<sub>2,i</sub>D<sub>3,i</sub>) + αX<sub>i</sub> + U<sub>i</sub>'''

Here,

:β<sub>2</sub> = differential effect of being a female

:β<sub>3</sub> = differential effect of being married

:β<sub>4</sub> = further differential effect of being ''both'' female ''and'' married

By this equation, in the absence of a non-zero error the wage of an unmarried male is β<sub>1</sub>+ αX<sub>i</sub>, that of an unmarried female is β<sub>1</sub>+ β<sub>2</sub> + αX<sub>i</sub>, that of being a married male is β<sub>1</sub>+ β<sub>3</sub> + αX<sub>i</sub>, and that of being a married female is β<sub>1</sub>+β<sub>2</sub>+ β<sub>3</sub> + β<sub>4</sub>+ αX<sub>i</sub> (where any of the estimates of the coefficients of the dummies could turn out to be positive, zero, or negative).

Thus, an interaction dummy (product of two dummies) can alter the dependent variable from the value that it gets when the two dummies are considered individually.<ref name=Gujarati/>

However, the use of products of dummy variables to capture interactions can be avoided by using a different scheme for categorizing the data&mdash;one that specifies categories in terms of combinations of characteristics. If we let

:D<sub>4</sub> = 1 if unmarried female, 0 otherwise
:D<sub>5</sub> = 1 if married male, 0 otherwise
:D<sub>6</sub> = 1 if married female, 0 otherwise

then it suffices to specify the regression

:'''Y<sub>i</sub> = δ<sub>1</sub> + δ<sub>4</sub>D<sub>4,i</sub> + δ<sub>5</sub>D<sub>5,i</sub> + δ<sub>6</sub>D<sub>6,i</sub> + αX<sub>i</sub> + U<sub>i</sub>.'''

Then with zero shock term the value of the dependent variable is δ<sub>1</sub>+ αX<sub>i</sub> for the base category unmarried males, δ<sub>1</sub> + δ<sub>4</sub>+ αX<sub>i</sub> for unmarried females, δ<sub>1</sub> + δ<sub>5</sub>+ αX<sub>i</sub> for married males, and δ<sub>1</sub> + δ<sub>6</sub>+ αX<sub>i</sub> for married females. This specification involves the same number of right-side variables as does the previous specification with an interaction term, and the regression results for the predicted value of the dependent variable contingent on X<sub>i</sub>, for any combination of qualitative traits, are identical between this specification and the interaction specification.

==Dummy dependent variables==

===What happens if the dependent variable is a dummy?===

A model with a dummy dependent variable (also known as a qualitative dependent variable) is one in which the dependent variable, as influenced by the explanatory variables, is qualitative in nature. Some decisions regarding 'how much' of an act must be performed involve a prior decision making on whether to perform the act or not. For example, the amount of output to produce, the cost to be incurred, etc. involve prior decisions on whether to produce or not, whether to spend or not, etc. Such "prior decisions" become dependent dummies in the regression model.<ref name=Wabash>{{cite book|first1=Humberto|last1=Barreto|first2= Frank|last2=Howland |title=Introductory Econometrics: Using Monte Carlo Simulation with Microsoft Excel|chapter=Chapter 22: Dummy Dependent Variable Models|url=http://www3.wabash.edu/econometrics/EconometricsBook/chap22.htm|isbn=0-521-84319-7|year=2005|publisher=Cambridge University Press}}</ref>

For example, the decision of a worker to be a part of the labour force becomes a dummy dependent variable. The decision is [[Dichotomy|dichotomous]], i.e., the decision has two possible outcomes: yes and no. So the dependent dummy variable Participation would take on the value 1 if participating, 0 if not participating.<ref name=Gujarati/> Some other examples of dichotomous dependent dummies are cited below:

'''Decision:''' Choice of Occupation. '''Dependent Dummy:''' Supervisory = 1 if supervisor, 0 if not supervisor.

'''Decision:''' Affiliation to a Political Party. '''Dependent Dummy:''' Affiliation = 1 if affiliated to the party, 0 if not affiliated.

'''Decision:''' Retirement. '''Dependent Dummy:''' Retired = 1 if retired, 0 if not retired.

When the qualitative dependent dummy variable has more than two values (such as affiliation to many political parties), it becomes a multiresponse or a multinomial or [[Polychotomy|polychotomous]] model.<ref name=Wabash/>

===Dependent dummy variable models===

Analysis of dependent dummy variable models can be done through different methods. One such method is the usual [[Least squares|OLS]] method, which in this context is called the [[linear probability model]]. An alternative method is to assume that there is an unobservable continuous latent variable Y<sup>*</sup> and that the observed dichotomous variable Y = 1 if Y<sup>*</sup> > 0, 0 otherwise. This is the underlying concept of the [[Logistic regression|logit]] and [[Probit model|probit]] models. These models are discussed in brief below.<ref name=Maddala>{{cite book|last=Maddala|first=G S|title=Introduction to econometrics|year=1992|publisher=Macmillan Pub. Co.|isbn=0-02-374545-2|pages=631|url=https://books.google.com/books?id=nBS3AAAAIAAJ&dq=introduction%20to%20econometrics%20maddala}}</ref>

====Linear probability model====

{{Main|Linear probability model}}

An ordinary least squares model in which the dependent variable ''Y'' is a dichotomous dummy, taking the values of 0 and 1, is the [[linear probability model]] (LPM).<ref name=Maddala/> Suppose we consider the following regression:

:<math>Y_{i} = \alpha_{1} + \alpha_{2} X_{i} + u_{i}</math>

where

:<math>X</math> = family income

:<math>Y = 1</math> if a house is owned by the family, 0 if a house is not owned by the family

The model is called the ''linear probability model'' because, the regression is linear. The [[Conditional expectation|conditional mean]] of Y<sub>i</sub> given X<sub>i</sub>, written as <math>\mathbb{E}(Y_{i}|X_{i})</math>, is interpreted as the [[conditional probability]] that the event will occur for that value of ''X''<sub>''i''</sub> — that is, Pr(''Y''<sub>''i''</sub> = 1 |''X''<sub>''i''</sub>). In this example, <math>\mathbb{E}(Y_{i}|X_{i})</math> gives the probability of a house being owned by a family whose income is given by ''X''<sub>''i''</sub>.

Now, using the [[Least squares|OLS]] assumption <math>E(u_{i}|X_{i}) = 0</math>, we get

:<math>\mathbb{E}(Y_{i}|X_{i}) = \alpha_{1} + \alpha_{2} X_{i}</math>

Some problems are inherent in the LPM model:
# The regression line will not be a [[Goodness of fit|well-fitted]] one and hence measures of significance, such as R<sup>2</sup>, will not be reliable.
# Models that are analyzed using the LPM approach will have [[Heteroscedasticity|heteroscedastic]] disturbances.
# The error term will have a non-normal distribution.
# The LPM may give predicted values of the dependent variable that are greater than 1 or less than 0. This will be difficult to interpret as the predicted values are intended to be probabilities, which must lie between 0 and 1.
# There might exist a non-linear relationship between the variables of the LPM model, in which case, the linear regression will not fit the data accurately.<ref name=Gujarati/><ref name=DD>Adnan Kasman, {{cite web|title=Dummy Dependent Variable Models|url=http://kisi.deu.edu.tr/evrim.gursoy/Dummy_Dependent_Variables_Models.doc}}. Lecture Notes</ref>

====Alternatives to LPM====

[[File:CDF graph.jpg|thumb|right|400px|Figure 4 : A cumulative distribution function.]]

To avoid the limitations of the LPM, what is needed is a model that has the feature that as the explanatory variable, ''X''<sub>''i''</sub>, increases, ''P''<sub>''i''</sub> = E (''Y''<sub>''i''</sub> = 1 | ''X''<sub>''i''</sub>) should remain within the range between 0 and 1. Thus the relationship between the independent and dependent variables is necessarily non-linear.

For this purpose, a [[cumulative distribution function]] (CDF) can be used to estimate the dependent dummy variable regression. Figure 4 shows an 'S'-shaped curve, which resembles the CDF of a random variable. In this model, the probability is between 0 and 1 and the non-linearity has been captured. The choice of the CDF to be used is now the question.

Two alternative CDFs can be used: the [[logistic distribution|logistic]] and [[Normal distribution|normal]] CDFs. The logistic CDF gives rise to the [[Logistic regression|logit model]] and the normal CDF give rises to the [[probit model]]
.<ref name=Gujarati/>

====Logit model====

{{Main|Logistic regression}}

The shortcomings of the LPM led to the development of a more refined and improved model called the logit model. In the logit model, the cumulative distribution of the error term in the regression equation is logistic.<ref name=Maddala/> The regression is more realistic in that it is non-linear.

The logit model is estimated using the [[Maximum likelihood|maximum likelihood approach]]. In this model, <math>P(Y=1|X)</math>, which is the probability of the dependent variable taking the value of 1 given the independent variable is:

: <math>P_i = \frac{1}{1 + e^{-z_i}}\ = \frac{e^{z_i}}{1 + e^{z_i}}\ </math>

where <math>z_{i} = \alpha_{1} + \alpha_{2} X_{i} + u_{i}</math>.

The model is then expressed in the form of the [[odds ratio]]: what is modeled in the logistic regression is the natural logarithm of the odds, the odds being defined as <math>P/(1-P)</math>. Taking the natural log of the odds, the logit (''L''<sub>''i''</sub>) is expressed as

: <math>L_i = \ln\left(\frac{P_i}{1 - P_i}\right) = z_i = \alpha_1 + \alpha_2 X_i.</math>

This relationship shows that ''L''<sub>''i''</sub> is linear in relation to ''X''<sub>''i''</sub>, but the probabilities are not linear in terms of ''X''<sub>''i''</sub>.<ref name=DD/>

====Probit model====

{{Main|Probit model}}

Another model that was developed to offset the disadvantages of the LPM is the probit model. The probit model uses the same approach to non-linearity as does the logit model; however, it uses the normal CDF instead of the logistic CDF.<ref name=Maddala/>


==See also==
==See also==
* {{Annotated link|Binary regression}}

* [[Chow test]]
* {{Annotated link|Chow test}}
* [[Statistical hypothesis testing|Hypothesis testing]]
* {{Annotated link|Statistical hypothesis testing|Hypothesis testing}}
* [[Indicator function]]
* {{Annotated link|Indicator function}}
* [[Linear discriminant analysis|Linear discriminant function]]
* {{Annotated link|Linear discriminant analysis|Linear discriminant function}}
* [[Multicollinearity]]
* {{Annotated link|Multicollinearity}}
* {{Annotated link|One-hot}}
* [[Tobit model]]


==References==
==References==
{{notelist}}
{{Reflist}}


==Further reading==
{{Reflist}}
*{{cite book |first1=Dimitrios |last1=Asteriou |first2=S. G. |last2=Hall |author-link2=Stephen G. Hall |title=Applied Econometrics |location=London |publisher=Palgrave Macmillan |edition=3rd |year=2015 |isbn=978-1-137-41546-2 |chapter=Dummy Variables |pages=209–230 }}
*{{cite book |last=Kooyman |first=Marius A. |year=1976 |title=Dummy Variables in Econometrics |location=Tilburg |publisher=Tilburg University Press |isbn=90-237-2919-6 }}


==External links==
==External links==
{{Wikiversity|Dummy variable (statistics)}}
{{Wikiversity|Dummy variable (statistics)}}
*{{cite web |first=Marloes |last=Maathuis|author-link=Marloes Maathuis |title=Chapter 7: Dummy variable regression |work=Stat 423: Applied Regression and Analysis of Variance |date=2007 |url=http://stat.ethz.ch/~maathuis/teaching/stat423/handouts/Chapter7.pdf |archive-date=December 16, 2011 |archive-url=https://web.archive.org/web/20111216051820/https://stat.ethz.ch/~maathuis/teaching/stat423/handouts/Chapter7.pdf }}
* http://www.stat.yale.edu/Courses/1997-98/101/anovareg.htm
*{{cite web |first=John |last=Fox |date=2010 |title=Dummy-Variable Regression |url=https://socialsciences.mcmaster.ca/jfox/Courses/SPIDA/dummy-regression-notes.pdf }}
* http://udel.edu/~mcdonald/statancova.html
*{{cite web |first=Samuel L. |last=Baker |title=Dummy Variables |date=2006 |url=http://hspm.sph.sc.edu/courses/J716/pdf/716-6%20Dummy%20Variables%20and%20Time%20Series.pdf |archive-date=March 1, 2006 |archive-url=https://web.archive.org/web/20060301032127/http://hspm.sph.sc.edu/courses/J716/pdf/716-6%20Dummy%20Variables%20and%20Time%20Series.pdf }}
* http://stat.ethz.ch/~maathuis/teaching/stat423/handouts/Chapter7.pdf
* http://socserv.mcmaster.ca/jfox/Courses/SPIDA/dummy-regression-notes.pdf
* http://hspm.sph.sc.edu/courses/J716/pdf/716-6%20Dummy%20Variables%20and%20Time%20Series.pdf


{{DEFAULTSORT:Dummy Variable (Statistics)}}
{{DEFAULTSORT:Dummy Variable (Statistics)}}

Latest revision as of 01:19, 8 December 2023

In regression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes a binary value (0 or 1) to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.[1] For example, if we were studying the relationship between biological sex and income, we could use a dummy variable to represent the sex of each individual in the study. The variable could take on a value of 1 for males and 0 for females (or vice versa). In machine learning this is known as one-hot encoding.

Dummy variables are commonly used in regression analysis to represent categorical variables that have more than two levels, such as education level or occupation. In this case, multiple dummy variables would be created to represent each level of the variable, and only one dummy variable would take on a value of 1 for each observation. Dummy variables are useful because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. They can also help us to control for confounding factors and improve the validity of our results.

As with any addition of variables to a model, the addition of dummy variables will increase the within-sample model fit (coefficient of determination), but at a cost of fewer degrees of freedom and loss of generality of the model (out of sample model fit). Too many dummy variables result in a model that does not provide any general conclusions.

Dummy variables are useful in various cases. For example, in econometric time series analysis, dummy variables may be used to indicate the occurrence of wars, or major strikes. It could thus be thought of as a Boolean, i.e., a truth value represented as the numerical value 0 or 1 (as is sometimes done in computer programming).

Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the panel data fixed effects estimator dummies are created for each of the units in cross-sectional data (e.g. firms or countries) or periods in a pooled time-series. However in such regressions either the constant term has to be removed, or one of the dummies removed making this the base category against which the others are assessed, for the following reason:

If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect multicollinearity,[2] so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the dummy variable trap.

See also

[edit]

References

[edit]
  1. ^ Draper, N.R.; Smith, H. (1998) Applied Regression Analysis, Wiley. ISBN 0-471-17082-8 (Chapter 14)
  2. ^ Suits, Daniel B. (1957). "Use of Dummy Variables in Regression Equations". Journal of the American Statistical Association. 52 (280): 548–551. JSTOR 2281705.

Further reading

[edit]
  • Asteriou, Dimitrios; Hall, S. G. (2015). "Dummy Variables". Applied Econometrics (3rd ed.). London: Palgrave Macmillan. pp. 209–230. ISBN 978-1-137-41546-2.
  • Kooyman, Marius A. (1976). Dummy Variables in Econometrics. Tilburg: Tilburg University Press. ISBN 90-237-2919-6.
[edit]