Data Analysis

So far what we did was to look at SYSTAT to develop a basic idea on how SYSTAT for Windows works. The next step is to examine a few other data analysis procedures (e.g., correlation, regression, t-test) using SYSTAT for Windows. Only a limited number of procedures are discussed in this document. Refer to SYSTAT documents for further information.

Downloading Data

The data set we discussed in our earlier example was to get you started. Now we will examine another data set with more variables and cases, appropriate for the kind of analysis techniques we are examining.

In the example, you will import an ASCII data file, clas1.txt, created and saved in text format, into SYSTAT for Windows. The data collected from forty middle school students contains 28 variables. The first four variables (ID, SEX$, EXP, SCHOOL) are background variables. The Variable SEX$ has two levels (m=male, f=female). EXP (prior computer experience has three levels (1=less than one year, 2=1-2 years, 3=more than 2 years), SCHOOL (type of school system) has three levels (1=rural school, 2=suburban school, 3=urban school). The next 20 variables (C1...C10, M1...M10) are Likert type responses to a computer opinion survey, and mathematics anxiety survey. The last four variables (MATHSCOR, COMPSCOR, MANX, CANX) are scores on mathematics test, computer test, mathematics anxiety grouping, and computer opinion survey cumulative score. The variable MANX is a dichotomous variable created from low (coded as 0) and high (coded as 1) mathematics anxiety score.

To obtain a copy of this data file:

  • Using a web browser (Netscape, Internate Explorer, lynx, etc.), download Sample SYSTAT Data.
  • Save it to a file (for example, a:\clas1.txt).

Contact a STC consultant if you need assistance.

Import the file into SYSTAT using the method described earlier. The data will now be displayed on the Data window. Now you are ready for your data analysis.

Correlation Analysis

A correlation analysis is performed to quantify the strength of association between two numeric variables. In the following task we will perform a Pearson correlation analysis (SYSTAT can also perform a Spearman rank correlation). The variables used in the analysis are mathscor, compscor, and canx.

Correlation dialog box
  • From the Statistics menu select Correlations->Simple
  • Highlight the variables, individually or collectively, MATHSCOR, COMPSCOR, and CANX, and click [Add-->]
  • Selecting Options will open a Correlations: Options dialog box
  • Check the Probabilities box and select Uncorrected
  • Click Continue and finally OK

A symmetric matrix with the Pearson correlation as shown below will be displayed on the screen followed by another matrix with their probability values (p-values). The output (Quick Graph feature in Graph window) also includes a matrix of scatterplots (SPLOM) with one plot for each entry in the correlation matrix. Specify GRAPH=NONE in the command editor or select Options from the Edit menu and deselect Statistical Quickgraphs to suppress this feature.



Pearson correlation matrix
 
                  MATHSCOR     COMPSCOR         CANX
 MATHSCOR            1.000
 COMPSCOR            0.149        1.000
 CANX                0.068        0.657        1.000

Bartlett Chi-square statistic:    21.909 df=3 Prob= 0.000
 
Matrix of Probabilities
 
                  MATHSCOR     COMPSCOR         CANX
 MATHSCOR            0.000
 COMPSCOR            0.359        0.000
 CANX                0.676        0.000        0.000

Number of observations: 40

Correlation plot

Linear Regression

A correlation coefficient tells you that some sort of relation exists between the variables, but it does not tell you much more than that. For example, a correlation of 1.0 means that all points fall exactly on a straight line, but it says nothing about the form of the relation between the variables. When the observations are not perfectly correlated, many different lines may be drawn through the data. To select a line that describes the data, as close as possible to the points, you employ the regression analysis technique which is based on the least-squares principles. In the following task you will perform a simple regression analysis with 'canx' as the dependent variable, and 'compscor' as the independent variable.

  • From the Statistics menu select Regression->Linear
  • Highlight CANX as Dependent: variable and click on Add -->
  • Highlight COMPSCOR as Independent(s): and click on Add -->
Regression Dialog Box
  • Click OK

The output, as shown below, will be displayed on the screen with regression statistics including slope, intercept, and squared multiple R. Quick Graph feature appears in Graph window and it includes a plot of regression residuals against the predicted values. Use the same procedure as earlier to suppress this feature.

Dep Var: CANX   N: 40   Multiple R: 0.657   Squared multiple R: 0.432
 
Adjusted squared multiple R: 0.417   Standard error of estimate: 2.544
 
Effect       Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
 
CONSTANT          23.194        1.315        0.0        .      17.634    0.000
COMPSCOR           0.133        0.025        0.657     1.000    5.375    0.000
 
                             Analysis of Variance
 
Source             Sum-of-Squares   DF  Mean-Square     F-Ratio       P
 
Regression               186.922     1      186.922      28.891       0.000
Residual                 245.853    38        6.470
------------------------------------------------------------------------------
  
Durbin-Watson D Statistic     1.103
First Order Autocorrelation   0.406
Regression output graph

T-test

T-test is a data analysis procedure to test the hypothesis that two population means are equal. SYSTAT can compute both independent (unrelated groups) and dependent (related groups) t-tests. For independent t-tests, your grouping variable should have exactly two values (e.g., male/female, pass/fail). The grouping variable may either be numeric or character. If a grouping variable has more than two categories then you can use the Data/Select cases... menu to select the two values you want to perform t-test with. Once you select cases make sure you deselect it to restore the data set if you plan to use all the cases for subsequent data analysis.

In the following task we will perform an independent t-test. The dependent variables are mathscore, and compscor, and the independent (grouping) variable is manx.(If you do not select a grouping variable by default a paired t-test will be performed.)

  • From the Statistics menu select t-test->Two-Groups
T-test Plot
  • Highlight the dependent variables MATHSCOR, and COMPSCOR, and click Add-->
  • Highlight MANX for grouping variable and click Add-->
  • Click OK

The output from the run will be displayed on the screen as shown below. Quick Graph feature in Graph window includes a combined display of three graphs (a boxplot, a normal curve and a dit-plot) for each group. Use the same procedure as earlier to suppress this feature.

Two-sample t test on MATHSCOR grouped by MANX
 
  Group                N         Mean           SD
            0         28       53.750       13.845
            1         12       37.000       15.214

  
     Separate Variance t =        3.277 df =   19.2    Prob =        0.004
     Difference in Means =       16.750   95.00% CI =      6.058 to     27.442
  
       Pooled Variance t =        3.406 df =   38      Prob =        0.002
     Difference in Means =       16.750   95.00% CI =      6.793 to     26.707
T-test Plot

Two-sample t test on COMPSCOR grouped by MANX
 
  Group                N         Mean           SD
            0         28       51.000       15.253
            1         12       49.083       19.486

  
     Separate Variance t =        0.303 df =   17.1    Prob =        0.765
     Difference in Means =        1.917   95.00% CI =    -11.416 to     15.249
  
       Pooled Variance t =        0.335 df =   38      Prob =        0.740
     Difference in Means =        1.917   95.00% CI =     -9.671 to     13.505
T-test Plot

Analysis of Variance

The statistical technique used to test the null hypothesis that several means are equal is called analysis of variance. It is called that because it examines the variability in the sample and, based on the variability, it determines whether there is reason to believe the population means are not equal. In analysis of variance, the observed variability in the sample is divided, or partitioned, into two parts: the variability of observations within a group (around the group mean), and variability between the group means. If the two estimates are substantially different, you can reject the null hypothesis that the population means are equal. The statistical test for null hypothesis that all of the groups have the same mean in the population is based on computing the ratio of the two estimates, called an F statistic. The observed significance level is obtained by comparing the calculated F value to the F distribution (the distribution of the F statistic when the null hypothesis is true).

A significant F value only tells you that the means are probably not all equal. It does not tell you which pairs of groups appear to have different means. To pinpoint exactly where the differences are, multiple comparisons may be performed.

In the following exercise you will perform an ANOVA with canx as the dependent variable and 'exp' as the factor variable. To perform a pairwise mean comparisons to identify which means differ from others a Tukey HSD test has been employed.

  • From the Statistics menu select Analysis of Variance (ANOVA)/Estimate Model(Selecting General Linear Model/Estimate Model from Statistics menu will result in the same procedure.)
ANOVA Dialog Box
  • Highlight CANX as Dependent(s): and click [Add-->]
  • Highlight EXP as Factor(s): and click [Add-->]
  • Select Post hoc Tests and choose Tukey from drop-down list. (Selecting General Linear Model/Pairwise Comparisons from Statistics menu will result in the same procedure plus Dunnett's test. But this option becomes active only after you run your ANOVA.)
  • Click OK

The output as shown below will be displayed on the Main window. In the Graph window, Quick Graph includes a plot of residuals from each estimated cell mean versus the estimated cell mean.

Effects coding used for categorical variables in model.
 
Categorical values encountered during processing are:
EXP (3 levels)
          1,        2,        3
 
Dep Var: CANX   N: 40   Multiple R: 0.406   Squared multiple R: 0.165
 
 
                             Analysis of Variance
 
Source             Sum-of-Squares   df  Mean-Square     F-ratio       P
 
EXP                       71.466     2       35.733       3.659       0.035
 
Error                    361.309    37        9.765
 

------------------------------------------------------------------------------
ANOVA Output plot


------------------------------------------------------------------------------
 
*** WARNING ***
Case           39 is an outlier        (Studentized Residual =        3.385)
 
Durbin-Watson D Statistic     1.769
First Order Autocorrelation   0.043
COL/
ROW EXP
  1  1
  2  2
  3  3
Using least squares means.
Post Hoc test of CANX
------------------------------------------------------------------------------
 
Using model MSE of 9.765 with 37 df.
Matrix of pairwise mean differences:
 
                         1           2           3
              1          0.000
              2          2.800       0.000
              3          2.709      -0.091       0.000
 
Tukey HSD Multiple Comparisons.
Matrix of pairwise comparison probabilities:
 
                         1           2           3
              1          1.000
              2          0.054       1.000
              3          0.087       0.997       1.000
------------------------------------------------------------------------------

The output shows that there is a significant difference among groups with different levels of computer experience at least at .05 probability level.

The output for pairwise comparisons include a table of mean differences and another table of probabilities. To determine significant differences, examine the pairs and their probability level. From the output it is evident that there is a marginally significant difference between group 1 (exp=1) and group 2 (exp=2). None of the other combinations produced a significant difference at least at the 0.05 level.

Using SYSTAT's Graph Menu

SYSTAT provides a wide selection of graphics for every stage of your project: exploration, research, and presentation. The graphics capabilities of SYSTAT include:

  • histograms with curve fitting
  • bar graphs, box plots, stem-and-leaf diagrams, pie charts
  • 3-D rotation, maps with geographic projections
  • mathematical function plots, log and power scales
  • confidence intervals, ellipses, and centroids
  • contour plots, control charts
  • case coding of labels and symbols
  • linear, quadratic, step, spline, polynomial, LOWESS, exponential, and DWLS smoothing in two and three dimensions
  • rectangular, spherical, polar, cylindrical, and triangular coordinates, perspective depth and projections.

Plotting Two Variables with SYSTAT

Looking at a plot is one of the best ways to examine relationships and patterns. For example, a scatterplot allows the visual representation of two separate distributions on a single diagram.

In the following task you will plot the variables CANX (dependent variable) by COMPSCOR (independent variable). We will also fit the data points on the scatterplot based on the least-squares principle.

  • From Graph menu (Main window) select Plots and then choose Scatterplot
Scatterplot Dialog Box
  • Highlight COMPSCOR as X-variable: and click Add -->
  • Highlight CANX as Y-variable(s): and click Add -->
  • Choose Linear as smoother method from Options/Smoother and click Continue
  • Click OK

The plot, as shown below, will be displayed on the screen

Scatterplot output

To print the graphics output, select File/Print..., and respond to the queries appear in subsequent dialog boxes. You may save your graphics streams to a file using File/Save as... To remove the Graphics window select File/Close Window.

Detailed discussion of all the graphics capabilities of SYSTAT is not possible through this document. You may refer to SYSTAT's Graphics document for learning more about the graphics capability of SYSTAT.


Next: Further Reading
Prev: Getting Started
Up: Table of Contents