Navigating the page

Linear regression with categorical regressors. Its quality parameters and interpreting

Loglinear analysis. Primary exploring a deep interaction structure

Analysis of variance. Implementing the found out deep interaction structure

Regression with main effects and deep interaction structure
Modeling the ESS database's variable "Trust in the election institution" ("In the country of national elections are free and fair")
The case illustrates an application of regression with deep interaction effects, of variance analysis as a measure type criterion and as an exploratory method for strong effects searching, as well as descriptive statistics.
Required level of a user's qualification in data analysis: high.
It is desirable to be competent in methods of regression modeling, analysis of variance and loglinear analysis, descriptive statistics.
Analysis of variance. A criterion for variables' measure
Each video's shorter than 3 min.
The problem formulation. Researchers are debating whether to consider ordinal variables with 45 gradations categorical or interval. In the first case, for including in regression, such variables are subject to dichotomization, in the second  they can be used "as is". I propose an analysis of variance as a scale criterion for resolving this contradiction. The methods results depend on a case's specificity.
Analysis of variance makes it possible to model the dependence of a response on predictor of interest, both without taking into account the distances among its categories and taking into account ones (AnOVa by a factor and with a covariate, respectively). The Rsquared in the first case may not be lower than in the second one.
Just the difference between Rsquares by a factor and with a covariate allows making a conclusion about whether the distances among an ordinal variable's categories are "natural". In other words, if the difference between the two Rsquares is very small, then we can conclude that the distances among an ordinal variable's categories correspond to the distances among respective codes of this variable, regardless of the presence (AnOVa with covariates) or the absence (by the factor) of the model's requirements of such a correspondence. And since distances are "natural", the variable can be considered interval.
The difference between the two Rsquares indicates that the variable "Interest in politics" can be considered interval, and the variables "Proximity to any party" and "Selfassessment of social activity" should be considered categorical. Based on experience, I propose a threshold difference between two Rsquares of 10%. We must not forget that the use of AnOVa as a scale criterion is based on the context (that variable, which is dependent). It is likely that in the context of another dependent variable our 3 regressors would behave differently.
Comments can be left on Youtube directly
Analysis of variance and linear regression with categorical regressors. Primary exploring a functional link's shape
Each video's shorter than 3 min.
I'd like to build a model which helps to forecast the trust in election system degree among the Europeans. For this purpose, I use the hypothetically relevant variables from the ESS database: age of a respondent, his gender and highest level of education, did he voted last national election, his selfevaluation of his place in society, of how he is interested in politics, how important for him to live in democratically governed country, how close he is to a party, how often he takes part in social activities. The dependent variable (the trust in election system degree) has 11 gradations; the modal gradation's (the 10th) frequency equals 22.3%. That is why if forecasting by means of the modal gradation, the forecast's accuracy will equal roughly 22.3%. Consequently, I need a model which helps to forecast the dependent variable more accurately.
Analysis of variance may indicate what highest accuracy may be gained. This unique AnOVa's proper is due to its ability to model the dependence of a response on predictor of interest, both without taking into account the distances among its categories (AnOVa by a factor) and without taking into account a functional link's shape. Then, AnOVa may take into account any interaction effects constructed by the variables from the considering variable set. The AnOVa's Rsquared, under these circumstances, indicates the highest accuracy which may be gained by the considering variable set. Model with interaction effects is more complicated than without ones. That is why usually interaction effects are ignored when a regression model is being developed. Then, it is reasonable to use AnOVa's Rsquared without interaction effects.
I launch AnOVa without interaction effects. Categorical variables are placed in factors' window and scale and binary ones are placed in covariates' window. Then, I customize my model for avoiding interaction effects. Rsquared equals 29.3% and all the effects are significant. The Rsquared is higher than the modal forecast's accuracy. It is good news. So, I may develop a regression model. But I need to dichotomize all the categorical regressors.
I start AnOVa procedure with "Parameter estimation" option. Parameter estimation table is analogous to regression coefficients table. In this table, I may see which categories of categorical variables have the roughly equal influence on a dependent variable. Such categories may be considered together. For instance, the respondents who feel very close to a party and who feel quite close to a party may be considered as the same category.
For producing a script for SPSS syntax file, I copy the table to an MS Excel file. I need to apply some expression to all the cells containing categorical variables. For this purpose, I use some Excel's command which allows combining text blocks and several cells.
The needed expression consists of 5 parts: 3 text blocks and 2 cells. After expanding the expression downside, I move the expanded cells to an SPSS syntax file. Then, I apply frequency command to the constructed binary variables.
The new variables should be checked whether they have enough values of "1". Doing this, I move the table "Statistics" into an MS Excel file, transpose it, sort it and find out that the least value of "1"'s frequency equals 38. It is acceptable. Then, I shall recode the new variables' missings into a value of "0".
After the recoding, I may start linear regression. I place in the "Independent" window those regressors which are treated as scale and all the new variables. I got 8steps model. Its Rsquared equals 27.6%. It is smaller comparing to the AnOVa's Rsquared without interaction effects. But it is larger comparing to the modal forecast's accuracy.
The difference between the Rsquares allows making a conclusion about whether the linearity of the functional link is "natural". In other words, if the difference between the two Rsquares is very small (less than 10%), then we may conclude that the linearity appears, regardless of the presence (linear regression) or the absence (AnOVa) of the model's requirements of such a functional link's shape. And since linearity is "natural", the linear regression may be applied.
Comments can be left on Youtube directly
Linear regression with categorical regressors. Its quality parameters and interpreting
Each video's shorter than 3 min.
Since linear regression may be applied to my data, I go on constructing the model. I start regression procedure. The regressors which are treated as scale and binary variables are placed to the window "Independents". Other settings remain "as is".
Firstly, I shall exclude weak regressors, which have significance higher than 0.05 or rather small coefficients. If remain such regressors in a model, its Rsquared may be overestimated. For this purpose, I use a stepwise regression procedure and multiplying nonstandardized and standardized coefficients. The stepwise procedure removed 7 binary variables.
Looking at the remained regressors, I see that some of them are significant but weak. I mean that they exert a small influence on the dependent variable which may be neglected. I exclude such weak regressors manually and restart a stepwise procedure. After 3 iterations, I get 27 strong regressors.
Before interpreting the model, I shall examine it by 3 quality parameters: nonshifting, homoscedasticity, stability. For examing nonshifting and homoscedasticity, I need saving deleted residuals. They indicate which cases are poorly predicted by the model. The "Explore" procedure shows that the deleted residuals' mean value doesn't equal zero in population. Thus, the model systematically overestimates values of a dependent variable. Consequently, I shall add the deleted residuals' mean value to a model's constant.
Homoscedasticity means that residuals do not vary from one regressor's value to another. That is why there are a lot of methods grasping homoscedasticity by measuring association or correlation between each regressor and a dependent variable. I think that one of the most relevant and exhaustive methods for examing homoscedasticity is AnOVa, because it doesn't depend on a functional link's shape, meanwhile many other analogs depend. In my case, although some regressors influence on the response, the cumulative influence is negligibly small because AnOVa's Rsquared is roughly zero.
Since my model is homoscedastic, I am sure that the forecast's accuracy doesn't vary from one value of each regressor to another. If one has a heteroscedastic model, one shall change the type of using regression to robust regression. I will consider this type in another series of video. A model's stability means that it produces stable results being applied to another dataset. For examining the stability, I divide my sample into two subsamples randomly by creating a new random variable, distributed under Bernoulli distribution.
I restart regression procedure marking which subsample is training and which is testing. Rvalues calculated for a training subsample and for a testing one are roughly equal. Their difference is smaller than 10%. Consequently, my model produces accurate forecast both for a training subsample and for a testing one. If one's model is not stable,(overtrained) one may not generalize it for applying to other datasets. Since my model is sifted a bit, homoscedastic and stable, I may interpret it.
It is convenient to start interpreting by a constant. But usually a constant characterizes nobody. It happens when any of regressors may not equal zero. Thus, I have such regressors. Hence, I shall shift them into zero by recoding. After recoding, I change the mentioned regressors in the regression procedure. This procedure provides me with final results. My Rsquared equals 0.274 and the constant equals 1.33. The constant characterizes those Europeans who have zeroes with all the regressors.
Hence, they are people from the bottom of society, very interested in politics, voted last national election, who feel no importance of living in a democratically governed country, not satisfied with the national government and refused to answer the question regarding their social activity. Their education may not be identified exactly (because there are a number of nonadjacent gradations). The Europeans who demonstrate medium social activity or have the level of education "General ISCED 4A/4B, access ISCED 5B/lower tier 5A" have the higher level of trust. In contrast, the Europeans who are not interested in politics or didn't vote, or have the level of education "General ISCED 3A, access upper tier ISCED 5A/all 5" have the lower level of trust.
I may portray those Europeans who have the highest level of trust in an election system, according to my model. They are people from the top of society, very interested in politics, voted last national election, who strongly feel the importance of living in a democratically governed country, absolutely satisfied with the national government, demonstrate medium social activity and have the level of education "General ISCED 4A/4B, access ISCED 5B/lower tier 5A". In contrast, the Europeans who have the lowest level of trust, are people from the bottom of society, not interested in politics at all, not voted last national election, who feel no importance of living in a democratically governed country, absolutely dissatisfied with the national government, refused to answer the question regarding their social activity have the level of education "General ISCED 3A, access upper tier ISCED 5A/all 5". What are limitations of my model? Firstly, the variable of education is rather detailed. Secondly, I didn't include interaction effects though models with such effects usually more accurate. I will consider this type of regression in another series of video.
Comments can be left on Youtube directly
Loglinear analysis. Primary exploring a deep interaction structure
Each video's shorter than 3 min.
Since I am not satisfied enough with the accuracy of the linear regression model without interaction effects (constructed in the previous series of video), I may try finding relevant interaction effects for my model. It is a complicated computing task to estimate all effects of a full factorial model when using variables containing many gradations. Consequently, I need to simplify this task by transforming all my regressors and response into variables with 2 or 3 gradations. Hence, I include in AnOVa modeling nontransformed regressors as main effects and transformed regressors as parts of interaction effects. The variables "How close to party" and "Take part in social activities compared to others of same age" I identified as categorical variables by AnOVa procedure . They should be treated as factors, as well as the variable "Education". Other 6 original regressors and 8 transformed ones should be treated as covariates.
Unfortunately, AnOVa procedure's button interface doesn't allow constructing interaction effects of 6th level and higher. Hence, I shall use an SPSS syntax file and an MS Excel template.
The template is supplied with a manual on how to use it. After constructing all needed interaction effects, I insert them into AnOVa procedure by the syntax file's interface.
I gained AnOVa model with all possible interaction effects (its name is "full factorial model"). The AnOVa's Rsquared equals 0.377. It is challenging to construct a regression model with such a high Rsquared value. The problem is that there are too many effects in a full factorial model; it is too difficult for interpreting. Hence, I shall remove weak effects. The next problem is that effects of a full factorial model often are not estimated (because of collinearity). How may I choose which effects should be removed? I may use loglinear analysis. It is a complicated computing task to estimate all effects of a full factorial model. Consequently, I need to simplify this task by transforming all my regressors and response into variables with 2 or 3 gradations.
I examine whether a model is simple enough by "General loglinear" procedure. If its sampling zeroes amount is less then a half of a multientrance table's cells amount, it is acceptable. The more gradations exploring variables have, the more sampling zeroes amount is. After the variables of interest are prepared, I launch "Model selection" procedure by pushing the "Paste" button.
I change "Model selection" procedure's settings in the syntax file because it is usually needed more iterations and steps. For making my loglinear results more reliable, I change Pvalue from 0.05 to 0.01. The "Model selection" procedure was completed by 473 steps. Within 473rd step, there are 44 significant partial effects containing the dependent variable, and the model's fit criterion is not significant. Hence, I may implement these partial effects into AnOVa modeling. Before returning to AnOVa modeling in the next series of videos, I shall emphasize that loglinear analysis' results are valuable in themselves. Thus, I may explore deeply any of the significant partial effects. It is beneficial to take those partial effects which have a large Chisquared value. E.g., the interaction effect of age, education, trust in the election system, satisfaction with the government and "vote".
The fact that the mentioned interaction effect was found means that any interaction effects of higher levels contained the mentioned five variables (age, education, trust in the election system, satisfaction with the government and "vote") are not significant. For example, the interaction effect of a 6th level contained the variables of age, education, trust in the election system, satisfaction with the government and "vote" plus, saying, interest in politics is not significant. Hence, it is not beneficial to consider such an effect of a 6th level; and the simplest interaction effect which is beneficial to consider is the mentioned effect of a 5th level contained the variables of age, education, trust in the election system, satisfaction with the government and "vote". In contrast, why I may not consider the interaction effect of a 4th level without, saying, the variable of the vote and contained only the variables of age, education, trust in the election system, satisfaction with the government? Because doing this way, I encounter Simpson's paradox.
Exploring namely this effect regardless the whole model (which contains this effect along with other 65 partial effects), I shall figure out what categories of the five considering variables are associated. For this, I start "General loglinear" procedure with these variables and customize the model design by excluding the exploring interaction effect and including all its subeffects.
Such a model's name is "Reduced". If one has an interaction effect of a 7th level and higher one may use my MS Excel template considered in Video 2. After saving residuals of this model, I standardize the residuals. Positive standardized residuals exceeding the value of 1,96 indicate straight associated categories.
In my case, to be old (above 60), to have a master degree and higher, to be satisfied with the government, not to vote and to trust in the election system are strongly and straightly associated among themselves.
Comments can be left on Youtube directly
Analysis of variance. Implementing the found out deep interaction structure
Each video's shorter than 3 min.
I return to preparing regression modeling in this series of videos. I shall combine the loglinear results and AnOVa procedure. For this, I develop all the 44 partial effects contained the dependent variable of trust in the election system into a respective succession of subeffects. For this, I recourse to my MS Excel template one time more. I insert all the 44 partial effects into the template; then, separate each of them into simple variables by rows; then, remove the dependent variable. Finally, I change the dichotomized variables of age and education into the original variable of age and the original variable of education contained the four categories. This change makes a model more complicated but more accurate. If a computer may manage such a model it will be well.
The main effects for my model are remaining the same as they were in the model in the 3rd series of videos. After starting an AnOVa procedure's syntax with the main effects and the constructed interaction effects, I get a table with the effects' size ("Sum of squares" and "Mean square"). I move the table to an MS Excel file and sort descending it by the column contained "Mean square". Then, I look through the table from the top to the bottom finding the first effect with Sig. smaller than 0.1.
This effect is a boundary one. Effects starting from the mentioned one and to the bottom are considered strong. I move these strong effects to an AnOVa procedure's syntax and launch it again. Then, I repeat this step until all remaining effects may be considered strong. I look through the remaining effects searching those which are treated as factors. For such effects, I construct dummy variables. For this, one may use another MS Excel template named "Dichotomizer".
Dichotomizing the categorical variables of education, of how close to a party, of how socially active. The template is supplied with a manual on how to use it.
After dummy variables constructed, I construct variables for the selected interaction effects. For this, it is useful to apply "Parameter estimates" table referred to AnOVa procedure and contained all the interaction effects and their parts relevant to categorical variables gradations. This table provides preliminarily calculated linear regression coefficients for effects treated as covariates and for categories of effects treated as factors. These coefficients may help to simplify a model because one may consider jointly those homogeneous effects which have roughly equal coefficients. Thus, I may join categories [prtdgcl=1] (the Europeans who are very close to a party) and [prtdgcl=2 the Europeans who are quite close to a party].
For constructing 337 variables, I use MS Excel's and SPSS Syntax's commands.
Usually, variables constructed by multiplying of several binary variables are not distributed uniformly because the values of 1 are usually less frequent than the values of 0. Meanwhile, a uniform distribution is desirable for binary regressors. After dummy variables and variables for interaction effects constructed, I examine them whether they contain enough valid cases and whether their distribution is acceptable. Doing this, I search and remove those of them which contain less than 30 valid cases. And for constructed binary variables I search and remove those of them which less frequent category's frequency is less than 30. The remained effects are ready for regression modeling. Their number is 337. It is not a simple task for stepwise regression procedures, but calculatable.
All my constructed variables are easily interpretable. Thus, the variable "edulvlb_4__1.plinsoc_3__1.agea" indicates age of the Europeans who have a lower level of education and a medium position in society. And the variable "edulvlb_4__1.plinsoc_3__1" indicates which Europeans have the combination of a lower level of education and a medium position in society. Before starting regression modelling, I return to "Parameter estimates" table. This table provides preliminarily calculated linear regression coefficients for effects treated as covariates and for categories of effects treated as factors. These coefficients may help to simplify a model because one may consider jointly those homogeneous effects which have roughly equal coefficients. This simplification doesn't reduce a model's Rsquared. Thus, I may join categories [prtdgcl=1] (the Europeans who are very close to a party) and [prtdgcl=2 the Europeans who are quite close to a party]. I will consider in details this useful proper of "Parameter estimates" table in another series of video.
Comments can be left on Youtube directly
Analysis of variance. Implementing the found out deep interaction structure
Each video's shorter than 3 min.
I return to preparing regression modeling in this series of videos. I shall combine the loglinear results and AnOVa procedure. For this, I develop all the 44 partial effects contained the dependent variable of trust in the election system into a respective succession of subeffects. For this, I recourse to my MS Excel template one time more. I insert all the 44 partial effects into the template; then, separate each of them into simple variables by rows; then, remove the dependent variable. Finally, I change the dichotomized variables of age and education into the original variable of age and the original variable of education contained the four categories. This change makes a model more complicated but more accurate. If a computer may manage such a model it will be well.
The main effects for my model are remaining the same as they were in the model in the 3rd series of videos. After starting an AnOVa procedure's syntax with the main effects and the constructed interaction effects, I get a table with the effects' size ("Sum of squares" and "Mean square"). I move the table to an MS Excel file and sort descending it by the column contained "Mean square". Then, I look through the table from the top to the bottom finding the first effect with Sig. smaller than 0.1.