# Multiple analysis of variance (MANOVA) with a specified formula # by Aaron Albin # Script began: 3/7/2012 ########################### # WHEN TO USE THIS SCRIPT # ########################### # Use this script if all of the following is true: # * You have multiple (i.e. two or more) sets of measurements that you wish to simultaneously compare between two different sets of non-numeric 'conditions'. # * Your data are all stored in separate columns - two with the identifier labels and several more with the measurements. # * You have a specific idea about the relationship between your variables that you can codify in the form of a formula # Example: You are investigating the impact of gender and profession (professor, construction worker, CEO, etc.) on average hours sleep per night and body weight. # Note: If you don't have approximately the same number of data points inside each condition (e.g. because of several NA data points), the results must be taken with care, since the statistical procedures underlying R's ANOVA function 'aov()' were designed for 'balanced designs'. ############### # SELECT DATA # ############### # Run this to have R tell you the names for each of the columns that it ended up with after importing the data frame: #=====[1]===== colnames(Dataframe) #============= # Below, list the names of the various columns that contain numbers for the analysis, separated by commas. # You can have any number of them (as long as they are all separated by commas. # Each of the names must be surrounded by double-quotes ("). # Here we put these inside the 'c()' ('combine') function, which stores multiple things inside one object in R. #=====[2]===== Names_NumberColumns = c("NumberColumn1","NumberColumn2") #Names_NumberColumns = c("VOT1","ANS_B") #============= # Now run this to extract out the specific columns data you selected. #=====[3]===== y = as.matrix( Dataframe[,c(Names_NumberColumns)] ) #============= ################### # SPECIFY FORMULA # ################### # In this part of the script, you'll be writing a formula expressing how you think your variables are related to one another. It's important you understand how to specify models generally. # Below, seven examples are given of relationships between a predicted ('response'/'dependent') variable 'y' and one or more predictor ('independent') variables 'a', 'b', and 'c'. # The specific labels you'll use will come from the column names you obtained from Step 1. # The syntax in these nine examples can be generalized to cases with 4 or more terms in the model. # NOTE: The usage of the '+', '-', '*', '^', and ':' symbols in the specification of formulas here is special, diverging from the typical meaning of these symbols in R generally. # [EXAMPLE 1] Formula = y ~ a # This describes a simple linear relationship between y and a # [EXAMPLE 2] Formula = y ~ a + b # This says that y is a combination of a and b, but there is no interaction term. The '+' is how you combine multiple terms one-by-one. This is how 'random block designs' are coded. # [EXAMPLE 3] Formula = y ~ a + a:b # This says that y is a combination of a and an interaction of a and b, but not b alone. This illustrates how ':' can be used to explicitly define an interaction term. # [EXAMPLE 4] Formula = y ~ a + b + a:b # Abbreviation: y ~ a*b # This is a more typical example where y is described by a, b, and an interaction between a and b. Note the convenient abbreviation to describe this. # [EXAMPLE 5] Formula = y ~ a + b + c # This says that y is a linear combination of a, b, and c, but no interactions are included at all. Note this is just a simple expansion of Example 2. # [EXAMPLE 6] Formula = y ~ a + b + c + a:b + b:c + a:c # Abbreviation: y ~ (a+b+c)^2 # This says that y is a linear combination of a, b, and c, as well as their 'second-order interactions', but not their 'third-order interaction' a:b:c. Note the convenient abbreviation. # [EXAMPLE 7] y ~ a + b + c + a:b + b:c + a:c + a:b:c # Abbreviation: y ~ a*b*c # This is a more typical example where you include a, b, c, and all cominatoric interactions in the model, including a:b:c. Note the very convenient abbreviation, as well as the parallelism with Example 4. # Finally, any function (including the above seven examples) can also be modified in any of the following ways: # (1) An individual 'term' (i.e. component of a function) can be removed by putting a '-' symbol in front of it. Thus, for example, 'y ~ a*b - a' is the same as 'y ~ b + a:b'. # (2) All formulas have an implied intercept term unless it is removed by adding either '-1' or '+0' to the function. Thus, for example, rather than 'y ~ a*b', you would write 'y ~ a*b - 1' or 'y ~ a*b + 0' # If any of your label columns contain numbers (1,2,3,etc.), like subject numbers, then surround just that specific variable name with as.factor(). # For example, if LabelColumn1 were numbers like that, then say: Formula = NumberColumn1 ~ as.factor(LabelColumn1)*LabelColumn2 # Here, please specify your formula. # Leave the left side of the '~' unchanged - that is the 'y' variable we created in step 3. #=====[4]===== Formula = y ~ LabelColumn1*LabelColumn2 # Formula = y ~ SPKR * SYLLABIF #============= ################ # RUN ANALYSES # ################ # Now, finally, run this code to fit a linear 'multiple analysis of variance' model to the data using this formula. #=====[5]===== FittedModel = aov(formula=Formula, data=Dataframe, qr=TRUE, contrasts=NULL) #============= # The following displays all of the following for each term in the model: # * "Df" = degrees of freedom # * "Pillai" = the Pillai–Bartlett statistic # * "approx F" = approximate F statistic from the F-distribution # * "num Df" and "den Df" = The degrees of freedom used in determining the F value # * "Pr(>F)" = The p-value # For advanced users: As mentioned above, the code here calculates the Pillai–Bartlett statistic. See [5b], [5c] and [5d] below for three alternatives (the Wilks' Labmda, Hotelling-Lawley Trace, and Roy's Greatest Root statistics, respectively). #=====[6a]===== summary.manova(object=FittedModel, test="Pillai", intercept=FALSE, tol = 1e-7, keep.zero.df=TRUE) #============= ####################### # INSPECT THE RESULTS # ####################### # All of the above code utilizes the Pillai–Bartlett Trace statistic, which is recommended by Hand, D. J. and Taylor, C. C. (1987). Multivariate Analysis of Variance and Repeated Measures. Chapman and Hall. # Here are three other alternate ways of running the MANOVA based on different approximation methods. Run this instead of code line [6a] above, then run line [7] and/or [8] again. # The Wilks' Labmda statistic ('popular in the literature') #=====[6b]===== summary.manova(object=FittedModel, test="Wilks", intercept=FALSE, tol = 1e-7, keep.zero.df=TRUE) #============= # The Hotelling-Lawley Trace statistic #=====[6c]===== summary.manova(object=FittedModel, test="Hotelling-Lawley", intercept=FALSE, tol = 1e-7, keep.zero.df=TRUE) #============= # (Upper bound of) Roy's Greatest Root statistic #=====[6d]===== summary.manova(object=FittedModel, test="Roy",intercept=FALSE, tol = 1e-7, keep.zero.df=TRUE) #=============