Descriptive Data Analysis
Frequency Tables
To generate frequency tables, use the table() function. If you are interested in the frequencies of values for one variable, specify the variable name as the only argument to the table() function. You can also specify two variables separated by a comma in the function to create a contingency (cross-tabulation) table – the first variable listed will be put in the rows and the second variable will be put in the columns. The table() function will create tables of frequencies only. The commands margin.table() and prop.table() create tables of marginal frequencies and proportions for an existing contingency table that you have created. In these two functions, the first argument should be the name of a table object you have saved, and the second argument can be 1, 2, or blank. A second argument of 1 would analyze by row, 2 would analyze by column, and blank would analyze based on the total of the table. Try the following examples. The function ftable() displays the results more legibly.
Handy Tip: You can insert comments into your R syntax by typing a # (pound) sign. Any text after the # sign is not processed.
> attach(mydata) > table1 <- table(Colony, Open.Market.Cat) > ftable(table1) > table2 <- margin.table(table1, 1) #Frequencies summed over Open.Market.Cat > table3 <- margin.table(table1, 2) #Frequencies summed over Colony > table4 <- prop.table(table1) > ftable(table4) Open.Market.Cat 1 2 3 Colony 0 0.22666667 0.01333333 0.33333333 1 0.34666667 0.01333333 0.06666667
For two-way tables, you can use the chisq.test() function to test the independence of the row and column variables.
> chisq.test(table1) Pearson's Chi-squared test data: table1 X-squared = 11.7733, df = 2, p-value = 0.002776
Descriptive statistics
In addition to the summary() function discussed above, descriptive statistics may be obtained for individual variables, using the following functions: mean(), median(), max(), min(), range(), var(), sd(), quantile(), fivenum(), length(), which.max(), which.min(). For example, to obtain the range and standard deviation of the variable Gov.Spend, type:
> range(Gov.Spend) [1] 0.0057 0.3280 > sd(Gov.Spend) [1] 0.06049456
For bivariate descriptive statistics, use the cor() and cov() functions for correlation and covariance, respectively. To find the correlation between the variables Growth and Invest, type:
> cor(Growth, Invest) [1] 0.4751891
To find the correlation and covariance matrices of mydata, select all variables but Country (because it is a character string) by asking R to include only observations from the second to the eighth column:
> cor(mydata[,2:8]) > cov(mydata[,2:8]) (Output is omitted.)
We have used similar bracket notation before when selecting a specific observation in the data matrix, i.e .[2,3] represents row 2 and column 3. In this instance, by not specifying a row index, we are telling R to use all the rows. By specifying 2:8 in the column index, we are telling R to use columns 2,3,4,…7,8 in the dataset for this particular analysis.
Visualizing data
To create a scatterplot of two variables, say, Invest and Institutions, use the following syntax:
> plot(Institutions, Invest, xlab="Quality of Institutions", ylab="Share of Investment", main="Investment by Quality of Institutions", col="blue") > abline(lm(Invest~Institutions), col=”red”)
The code produces the scatter plot below. The first argument lists the variable on the horizontal axis and the second arguments lists the variable to use on the vertical axis. The arguments main, xlab and ylab specify the title of the plot and labels of the x and y axes, respectively. The argument col changes the color of the points from the default, black. The abline() function allows you to add a line to the scatterplot by specifying the slope and intercept of the line with the lm() function, which is explained in Section 5.2 below.
You may also wish to visualize your data using a histogram. The following function produces a histogram of the Institutions variable:
> hist(Institutions, col="gray", xlim=c(0,10), ylim=c(0,18), xlab="Quality of Institutions", ylab="Number of Countries", main="Quality of Institutions")
The argument col fills the bars with gray; xlim and ylim set the ranges on the x- and y-axes by specifying vectors (made with the c() function) representing the minimum and maximum values for each range. The xlab and ylab arguments label the x- and y-axes, respectively; main sets the title of the plot.
Next: Data Analysis
Prev: Preparing Data for Analysis
Up: Table of Contents



