This is part of the methodology tutorial (see its table of contents).
This tutorial is a short introduction to simple (rather confirmatory) statistics for beginners.
Quantitative data come in different types or forms as we have seen in descriptive statistics and scales tutorial.
Let's recall the three data types:
For each type of measure or combinations of types of measures, you will have to use particular analysis techniques. In other words, most statistical procedures only work with certain kinds of data types.
There is a bigger choice of statistical techniques for quantitative (interval) variables. Therefore scales like (1) strongly agree, (2) agree, (3) somewhat agree, etc. usually are treated as interval variables, although it's not totally correct to do so.
Data types are not the only technical constraints for the selection of a statistical procedure, sample size and data assumptions are others.
In addition to their data types, many statistical analysis types only work for given sets of data distributions and relations between variables.
In practical terms this means that not only you have to adapt your analysis techniques to types of measures but you also (roughly) should respect other data assumptions.
The most frequent assumption about relations between variables is that the relationships are linear.
In the following example the relationship is non-linear: students that show weak daily computer use have bad grades, but so do they ones that show very strong use.
Popular measures like the Pearson’s r correlation will "not work", i.e. you will have a very weak correlation and therefore miss this non-linear relationship.
Most methods for interval data also require a so-called normal distribution (see the Methodology tutorial - descriptive statistics and scales)
If you have data with "extreme cases" and/or data that is skewed (assymetrical), some individuals will have much more "weight" than the others.
Hypothetical example:
In addition you also should understand that extreme values already have more weight with variance-based analysis methods (i.e. regression analysis, Anova, factor analysis, etc.) since since distances are computed as squares.
The goal of statistical analysis is quite simple: find structure in the data. We can express this principle with two synonymous formulas:
DATA = STRUCTURE + NON-STRUCTURE
DATA = EXPLAINED VARIANCE + NOT EXPLAINED VARIANCE
Example: Simple regression analysis
In other words: regression analysis tries to find a line that will maximize prediction and minimize residuals.
Let's have look of what we mean be statistical analysis and what your typically have to do. We shall come back to most stages throughout this tutorial page:
Note: With statistical data analysis programs you easily can do several steps in one operation.
All statistical analysis produce various kinds (lots) of coefficients, i.e. numbers that will summarize certain kinds of informations.
Always make sure to use only coefficients that are appropriate for your data
There are four big kinds of coefficients and you find these in most analysis methods:
These four types are mathematically connected: E.g. the signification level is not just dependent on the size of your sample, but also on the strength of a relation.
Statistical data analysis methods can be categorized according to data types we introduced in the beginning of this tutorial module.
The following table shows a few popular simple bi-variate analysis methods for a given independent (explaining) variable X and a dependent (to be explained) variable Y.
Dependant variable Y | |||
---|---|---|---|
Quantitative (interval) |
Qualitative (nominal or ordinal) | ||
Independent (explaining) |
Quantitative |
Correlation and Regression |
Logistic regression |
Qualitative |
Analysis of variance |
Crosstabulations |
Popular multi-variate analysis
Dependant variable Y | |||
---|---|---|---|
Quantitative (interval) |
Qualitative (nominal or ordinal) | ||
Independent(explaining) |
Quantitative |
Factor Analysis, multiple regression, SEM, Cluster Analysis, |
Logit. Alternatively, transform X into a qualitative variable and see below or split a variable into several dichotomic (yes/no) variables and see to the left. |
Qualitative |
Anova |
Multidimensional scaling etc. |
Crosstabulation is a popular technique to study relationships between normal (categorical) or ordinal variables.
Crosstabulation is simple, but beginner nevertheless get it often wrong. You do have to remember the basic objective of simple data analysis: Explain variable Y with variable X.
Since you want to know the probability (percentage) that a value of X leads to a value of Y, you will have to compute percentages in order to able to "talk about probabilities".
In a tabulation, the X variable is usually put on top (i.e. its values show in columns) but you can do it the other way round. Just make sure that you get the percentages right !
Let's recall the simple experimentation paradigm in which most statistical analysis is grounded since research is basically about comparison. Note: X is put to the left (not on top):
Treatment | effect (O) | non-effect (O) | Total effect for a group |
---|---|---|---|
treatment: (group X) | bigger | smaller | 100 % |
non-treatment: (group non-X) | smaller | bigger | 100 % |
You have to interpret this table in the following way: The chance that a treatment (X) leads to a given effect (Y) is higher than the chance that a non-treatment will have this effect.
Anyhow, a "real" statistical crosstabulation example will be presented below. Let's first discuss a few coefficients that can summarize some important information.
Pearson's chi-square is by far the most common. If simply "chi-square" is mentioned, it is probably Pearson's chi-square. This statistic is used to text the hypothesis of no association of columns and rows in tabular data. It can be used with nominal data.
We want to know if ICT training will explain use of presentation software in the classroom.
There are two survey questions:
Now let's examine the results
X= Did you receive some formal ICT training ? | Total | ||||
No | Yes | ||||
Y= Do you you use a computer to prepare slides for classroom presentations ? |
Regularly | Count | 4 | 45 | 49 |
% within X | 44.4% | 58.4% | 57.0% | ||
---|---|---|---|---|---|
Occasionally | Count | 4 | 21 | 25 | |
% within X | 44.4% | 27.3% | 29.1% | ||
2 Never | Count | 1 | 11 | 12 | |
% within X | 11.1% | 14.3% | 14.0% | ||
Total | Count | 9 | 77 | 86 | |
% within X | 100.0% | 100.0% | 100.0% |
The probability that computer training ("Yes") leads to superior usage of the computer to prepare documents is very weak (you can see this by comparing the % line by line.
The statistics tell the same story:
Therefore: Not only is the relationship very weak, but it can not be interpreted. In other words: There is absolutely no way to assert that ICT training leads to more frequent use of presentation software in our case.
We want to know if the teacher's belief that students will gain autonomy when using Internet resource will have an influence on classroom practice, i.e. organize activities where learners have to search information on the Internet.
(translation needed)
X= Leaners will gain autonomy through using Internet resources (teacher belief) | |||||||
0 Fully disagree | 1 Rather disagree | 2 Rather agree | 3 Fully agree | Total | |||
Y= Search information on the Internet |
0 Regularly | Count | 0 | 2 | 9 | 11 | 22 |
% within X | .0% | 18.2% | 19.6% | 42.3% | 25.6% | ||
---|---|---|---|---|---|---|---|
1 Occasionnally | Count | 1 | 7 | 23 | 11 | 42 | |
% within X | 33.3% | 63.6% | 50.0% | 42.3% | 48.8% | ||
2 Never | Count | 2 | 2 | 14 | 4 | 22 | |
% within X | 66.7% | 18.2% | 30.4% | 15.4% | 25.6% | ||
Total | Count | 3 | 11 | 46 | 26 | 86 | |
% within X | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
The statistical coefficients we use is "Directional Ordinal by Ordinal Measures with Somer’s D":
Values | Somer’s D | Significance |
---|---|---|
Symmetric | -.210 | .025 |
Y = Search information on the Internet - Dependent | -.215 | .025 |
Therefore, teacher's belief explain things, but the relationship is very weak ....
Analysis of variance (and it’s multi-variate variant Anova) are the favorite tools of the experimentalists. It is also popular in quasi-experimental research and survey research as the following example shows.
X is an experimental condition (therefore a nominal variable) and Y usually is an interval variable.
Example: Does presence or absence of ICT usage influence grades ?
Significance improves when:
Analysis of variance can be found in two different locations:
In this example we want to know if teacher trainees (e.g. primary teacher students) are different from "real" teachers regarding three kinds of variables:
COP1, COP2, COP3 are indices (composite variables) that range from 0 (little) to 2 (a lot)
Therefore we compare the average (mean) of the populations for each variable.
Population | COP1 Frequency of different kinds of learner activities | COP2 Frequency of exploratory activities outside the classroom | COP3 Frequency of individual student work | |
---|---|---|---|---|
1 Teacher trainee | Mean | 1.528 | 1.042 | .885 |
N | 48 | 48 | 48 | |
Std. Deviation | .6258 | .6260 | .5765 | |
2 Regular teacher | Mean | 1.816 | 1.224 | 1.224 |
N | 38 | 38 | 38 | |
Std. Deviation | .3440 | .4302 | .5893 | |
Total | Mean | 1.655 | 1.122 | 1.035 |
N | 86 | 86 | 86 | |
Std. Deviation | .5374 | .5527 | .6029 |
Standard deviations within groups are rather high (in particular for students), which is a bad thing: it means that among students they are highly different.
At this stage, all you will have to do is look at the sig. level which should be below 0.5. You only accept 4.99% chance that the relationship is random.
Variables (Y) explained by population (X) | Sum of Squares | df | Mean Square | F | Sig. | |
---|---|---|---|---|---|---|
COP1 Frequency of different kinds of learner activities * Population |
Between Groups | 1.759 | 1 | 1.759 | 6.486 | .013 |
Within Groups | 22.785 | 84 | .271 | |||
Total | 24.544 | 85 | ||||
COP2 Frequency of exploratory activities outside the classroom * Population |
Between Groups | .703 | 1 | .703 | 2.336 | .130 |
Within Groups | 25.265 | 84 | .301 | |||
Total | 25.968 | 85 | ||||
COP3 Frequency of individual student work * Population |
Between Groups | 2.427 | 1 | 2.427 | 7.161 | .009 |
Within Groups | 28.468 | 84 | 339 | |||
Total | 30.895 | 85 |
Measures of Association
Eta |
Eta Squared | |
---|---|---|
Var_COP1 Frequency of different kinds of learner activities * Population |
.268 |
.072 |
Var_COP2 Frequency of exploratory activities outside the classroom * Population |
.164 |
.027 |
Var_COP3 Frequency of individual student work * Population |
.280 |
.079 |
Result: Associations are week and explained variance very weak. The "COP2" relation is not significant.
We already introduced the principle of linear regression above. It is use to compute a trend between an explaining variable X and explained variable Y. Both must be quantitative variables.
Let's recall the principle: Regression analysis tries to find a line that will maximize prediction and minimize residuals.
We have two parameters that summarize the model:
The Pearson correlation (r) summarizes the strength of the relation
R square represents the variance explained.
The question: Does teacher age explain exploratory activities outside the classroom ?
R | R Square | Adjusted R Square | Std. Error of the Estimate | Pearson Correlation | Sig. (1-tailed) | N |
---|---|---|---|---|---|---|
.316 | .100 | .075 | .4138 | .316 | .027 | 38 |
Coefficients | Stand. coeff. | t | Sig. | Correlations | ||
---|---|---|---|---|---|---|
B | Std. Error | Beta | Zero-order | |||
(Constant) | .706 | .268 | 2.639 | .012 | ||
AGE Age | .013 | .006 | .316 | 1.999 | .053 | .316 |
Dependent Variable: Var_COP2 Fréquence des activités d'exploration à l'extérieur de la classe |
All this means:
Formally speaking, the relation is:
exploration scale = .705 + 0.013 * AGE
It also can be interpreted as: "only people over 99 are predicted a top score of 2" :)
Here is a scatter plot of this relation:
There is no need for statistical coefficients to see that the relation is rather week and why the prediction states that it takes a 100 years to get there... :)
There are excellent statistics resources on the web. For starters we recommend:
See Research methodology resources for more pointers.