Stat/Math
Software Support
Software Consulting
Software Availability
Software Price
Contact

User Support
Documentation
Knowledge Base
Education
Consulting
Podcasts

Systems & Services
Cyberinfrastructure
Supercomputers
Grid Computing
Storage
Visualization
Digital Libraries & Data

Results & Impact
Publications
Grants & Grant Info
Events & Outreach
Economic Impact
Survey Results

Vision & Planning
News & Features

# 1. Introduction

1.1Factor Analysis and Latent Variables

Factor analysis is a common statistical method used to find a small set of unobserved variables (also called latent variables, or factors) which can account for the covariance among a larger set of observed variables (also called manifest variables). For example, scores on multiple tests may be indicators of intelligence (Spearman, 1904); political liberties and popular sovereignty may measure the quality of a country’s democracy (Bollen, 1980); or issue emphases in election manifestos may signify a political party’s underlying ideology (Gabel & Huber, 2000). Factor analysis is also used to assess the reliability and validity of measurement scales (Carmines & Zeller, 1979).

1.2 Exploratory versus Confirmatory Factor Analysis

It is possible to distinguish between two categories of factor analysis depending on whether the investigator wishes to explore patterns in the data or to test explicitly stated hypotheses. Exploratory factor analysis (EFA), corresponding to the former task, is available in general purpose statistical software such as SPSS, SAS, and Stata. When carrying out an EFA no substantive constraints are imposed on the data. Instead, it is assumed that each common factor affects every observed variable and that the common factors are either all correlated or uncorrelated. Confirmatory factor analysis (CFA), on the other hand, is theory-driven. With CFA it is possible to place substantively meaningful constraints on the factor model, such as setting the effect of one latent variable to equal zero on a subset of the observed variables. The advantage of CFA is that it allows for testing hypotheses about a particular factor structure.

CFA is a special case of the structural equation model (SEM), also known as the covariance structure (McDonald, 1978) or the linear structural relationship (LISREL) model (Jöreskog & Sörbom, 2004). SEM consists of two components: a measurement model linking a set of observed variables to a usually smaller set of latent variables and a structural model linking the latent variables through a series of recursive and non-recursive relationships. CFA corresponds to the measurement model of SEM and as such is estimated using SEM software. This document considers estimating confirmatory factor models using Amos (Arbuckle, 2005); LISREL (Jöreskog & Sörbom, 2004), and Mplus (Muthén & Muthén, 2006). CFA and SEM can also be estimated using the CALIS procedure in SAS. All four programs are supported by the Stat/Math Center at Indiana University. EQS, another popular SEM program, is currently not supported.

1.3 Model Specification and Identification

It is common to display confirmatory factor models as path diagrams in which squares represent observed variables and circles represent the latent concepts. Additionally, single-headed arrows are used to imply a direction of assumed causal influence, and double-headed arrows are used to represent covariance between two latent variables. Figure 1 below provides an example.

Figure 1: Path Diagram of a Confirmatory Factor Model

In factor analysis the researcher almost always assumes that the latent variables “cause” the observed variables, as shown by the single-headed arrows pointing away from the circles and towards the manifest variables. The two ξ (xi) latent variables represent common factors, with paths pointing to more than one observed variable. The circles labeled δ (delta) represent unique factors because they affect only a single observed variable. The δi incorporate all the variance in each xi not captured by the common factors, such as measurement error. In this model the two ξi are expected to covary, as represented by the two-headed arrow. Additionally, error in the measurement of x3 is expected to correlate to some extent with measurement error for x6. This may occur, for example, with panel data in which ξ1 and ξ2 represent the same concept measured at different points in time; if there is measurement error at t1 it is likely that there will be measurement error at t2.

To facilitate formal presentation it is conventional to treat the observed and latent variables as deviations from their means, thus eliminating the need to estimate an intercept. The confirmatory factor model can be summarized by the equation

in which x is the vector of observed variables, λ (lambda) is the matrix of loadings connecting the &xii to the xi, ξ is the vector of common factors, and δ is the vector of unique factors. It is assumed that the error terms have a mean of zero, E(δ) = 0, and that the common and unique factors are uncorrelated, E(ξδ′)=0. Equation 1 can be rewritten for Figure 1 as:

Here the similarities with regression analysis are evident, with each xi a linear function of one or more common factors plus an error term (there is no intercept when the variables are mean centered). The primary difference between these factor equations and regression analysis is that the ξi are unobserved. Consequently estimation proceeds in a manner distinct from the conventional approach of regressing each xi on the ξi.

1.4 Estimation

When the x variables are measured as deviations from their means it is easy to show that the sample covariance matrix for x, represented by Σ (sigma), can be decomposed as follows:

where φ (phi) represents the covariance matrix of the ξ factors and θ (theta) represents the covariance matrix of the unique factors δ (Bollen, 1989, pg. 236). Estimation proceeds by finding the parameters , , and whose predicted is as close to the sample x covariance matrix as possible. Several different fitting functions exist for determining the closeness of the implied covariance matrix to the sample covariance matrix, of which maximum likelihood is the most common. This document includes examples using maximum likelihood, including Full Information Maximum Likelihood (FIML) for situations in which there are missing values in the raw data file. It will also describe a weighted least squares (WLS) approach suitable for situations in which the x variables are categorical.

One essential step in CFA is determining whether the specified model is identified. An unidentified model is one for which it is impossible to come up with unique parameter estimates. As a simple example, the equation 10 = 2x + 3y is not identified because an infinite number of values for x and y could make the equation true. In CFA, a model is identified if all of the unknown parameters can be rewritten in terms of the variances and covariances of the x variables. A full discussion of the topic in the context of CFA is available in Bollen (1989, chapter 7), including some necessary and sufficient conditions for identification. When the sufficient rules are not met, however, identification can only be definitively proven by carrying out the algebraic manipulations needed to rewrite the unknown parameters in terms of the observed variances and covariances.

Without introducing some constraints any confirmatory factor model is not identified. The problem lies in the fact that the latent variables are unobserved and hence their scales are unknown. To identify the model it therefore becomes necessary to set the metric of the latent variables in some manner. The two most common constraints are to set either the variance of the latent variable or one of its factor loadings to one.

1.5 Goodness of Fit

After estimating a CFA, the next step is to assess how well the model matches the observed data. A large class of omnibus tests exists for determining overall model fit. It is conventional to report at least one of these as well as the individual regression weights (factor loadings) and some indication of their significance. The example used in this document reports a χ2 statistic, available in all packages, for which the null hypothesis is that the implied covariance matrix is equivalent to the observed covariance matrix. Failure to reject the null is therefore a sign of a good model fit. However, the χ2 test is widely recognized to be problematic (Jöreskog, 1969). It is sensitive to sample size, such that the null is too easily rejected given large samples. The χ2 test may also be invalid when distributional assumptions are violated, leading to the rejection of good models or the retention of bad ones.

Because of these drawbacks, many alternative fit statistics have been developed, though each has its own advantages and disadvantages. Another commonly reported statistic is the Root Mean Square Error of Approximation (RMSEA), a measure of fit introduced by Steiger and Lind (1980). The Amos 6 User’s Guide suggests that “a value of the RMSEA of about 0.05 or less would indicate a close fit of the model in relation to the degrees of freedom,” although “this figure is based on subjective judgement” and “cannot be regarded as infallible” (Arbuckle, 2005, pg. 496). Additional statistics, such as the Akaike Information Criterion (Akaike, 1987) and Schwarz’s Bayesian Information Criterion (Schwartz, 1978), can be used to compare two differently specified models.

Appendix C of the Amos 6.0 User’s Guide provides summaries of many different fit measures (Arbuckle, 2005). For a thorough discussion of different tests, see Bollen and Long’s (1993) edited volume. Hu and Bentler (1999) provide rules of thumb for deciding which statistics to report and choosing cut-off values for declaring significance.

This document provides step-by-step examples for conducting a CFA with three commonly used packages: Amos, LISREL, and Mplus. The next section provides an example of a one- and two-factor CFA with six observed indicators. Section 3 extends Section 2 to cover cases involving missing data. Section 4 discusses the commonly encountered situation in which the observed variables are categorical rather than continuous. Section 5 provides a brief summary.