Stat/Math
Software Support
Software Consulting
Software Availability
Software Price
Contact

User Support
Documentation
Knowledge Base
Education
Consulting
Podcasts

Systems & Services
Cyberinfrastructure
Supercomputers
Grid Computing
Storage
Visualization
Digital Libraries & Data

Results & Impact
Publications
Grants & Grant Info
Events & Outreach
Economic Impact
Survey Results

Vision & Planning
News & Features

Preparing Data for Analysis

Data types

There are several types of data structures in R: vectors, matrices, arrays, factors, time series, data frames and lists. This tutorial will focus on data frames because they are the most commonly used structures for statistical analysis. Data frames are two-dimensional objects that include variable names and information about the variables (for example, whether they are numerical or categorical). Data frames can contain missing values coded as NA; however, most statistical analyses will require you to delete missing values. Data frames can be created with the data.frame() function applied to  matrices or lists, or by importing external data files.

Functions

Many of the statistical procedures and commands you use in R will be in the form of a function. In a general sense, a function is some type of procedure that takes input arguments and produces output. For instance, the c() function combines together the arguments that are passed to it. So, in order to make a vector of 5 data points, I can type the following command:

> x=c(1,4,7,5,3)

and press enter. I have created a new vector named x and it has the five elements of 1,4,7,5, and 3. To view x and its values, type x at the prompt and press enter. Now, I can use other functions to calculate mathematical quantities from that vector. For instance to compute the sum and the mean of the five values, I can pass the vector x as the argument to the sum() and mean() functions:

> sum(x)

[1] 20

> mean(x)

[1] 4

These examples show that functions may need one argument (such as supplying the vector name x to the sum or mean functions) or multiple arguments (such as the five values to put into the vector x). Throughout the rest of this tutorial you will see many functions that "take" various arguments including data frames and option parameters. You have also seen that the output from a function can either be displayed in the command window (such as we did with the sum and mean functions) or it can be assigned to another data element (such as when we assigned the 5 values to the vector named x). The vector x will continue to have those 5 values associated with it throughout the rest of the R session, unless we re-assign new data to it or clear it.

To learn more about any function in R, use the help function with the argument being the function you want to see the help files for. For instance, to learn more about the c() function, typing:

> help(c)

will bring up a webpage with information about that function and how to use it.

For all examples in this tutorial, we will use a dataset from Jeffrey D. Sachs and Andrew M. Warner, “Sources of Slow Growth in African Economies,” which is available through the Center for International Development at Harvard University. A truncated version of this dataset can be downloaded here: AfricaData.txt. The function read.table() imports an external data file into R and creates a data frame for statistical analysis. Thus, in order to read the dataset into R, we use the following command:

sep=“,”)

The above command reads a text file from the user’s temporary folder (you may need to specify a different directory) into an R data frame called mydata. Note the use of forward slashes in file names in R instead of backslashes. The argument header=TRUE specifies that the first line of the text file contains the names of the variables, and sep=”,” indicates that the values are separated by a comma. Typing:

> mydata

and pressing enter will allow you to look at the data you’ve just read in:

country growth govspend invest colony openmarket accesssea

1           ALGERIA  1.478   0.0632  27.14      1  0.0000000         0

2         ARGENTINA -0.688   0.0515  16.87      0  0.0000000         0

3         AUSTRALIA  1.152   0.0263  27.44      0  1.0000000         0

4           AUSTRIA  2.161   0.0753  25.89      0  1.0000000         1

5        BANGLADESH  0.141   0.2938   3.13      1  0.0000000         0

institutions

1        4.36458

2        4.28125

3        9.42968

4        9.44791

5        2.73958

(The rest of the output is omitted.)

The dataset contains 117 countries and 8 variables. The variable country is a character while the rest of the variables are numeric. The variable country lists the country for each observation; growth indicates economic growth from 1970 to 1990; govspend is a measure of government spending; invest indicates investment share from 1970 to 1989; colony is a dummy variable indicating whether the country is a former colony; openmarket is the share of years between 1965 and 1990 for which the country had an open market economy; and institutions is a measure of the quality of institutions. Notice (in the full output on your screen) that there are multiple missing values, labeled NA.

The function class() will return the data type, and summary() will produce descriptive statistics for each variable:

> class(mydata)

[1] “data.frame”

> summary(mydata)

(Output is omitted.)

Manipulating data in R

Most statistical analyses in R cannot be implemented with missing values. To create a dataset without the missing values, ask R to omit all values labeled NA using the na.omit() function:

> mydata <- na.omit(mydata)

Note that this command overwrote the original data frame named mydata.  You could specify a new data frame name for the non-missing data instead.  To call a specific variable (e.g., country) from the dataset, type mydata\$country or mydata[,”country”]. This is helpful if you have multiple data frames involved in your program.  For instance if you have a country variable in both mydata1 and mydata2, you would want to make sure you are referencing the correct data by stating either mydata1\$country or mydata2\$country.  If you only have one data frame that you are working with, or each data frame has different variable names, you can avoid having to type the mydata\$ by using the attach() function.    Attaching a data frame in R is like telling R, “From now on, look at this data frame as I reference variable names.”  The function detach() undoes attach().  The following commands attach the mydata data frame, and then ask for a listing of the country variable values:

> attach(mydata)

> country

[1] ALGERIA ARGENTINA AUSTRALIA AUSTRIA BANGLADESH BELGIUM

[7] BENIN BOLIVIA BRAZIL BURKINA FASO BURUNDI CAMEROON

(The rest of the output is omitted.)

Again, the reason we could type country and not mydata\$country is because we attached the data frame.  The names() function will produce a list of variable names in the data frame specified:

> names(mydata)

[1] "country" "growth" "govspend" "invest" "colony" "openmarket"

[8] "institutions"

To rename variables or create variables names (if your data does not contain variable names), use the following syntax:

> names(mydata) <- c("Country", "Growth", "Gov.Spend", "Invest",

"Colony", "Open.Market", "Institutions")

Here we see another use of the c() function, this time it puts together a list of string values (in quotation marks) and assigns it to the names of the mydata data frame.  Now examine the variable names of mydata and see how they have changed:

> names(mydata)

[1] "Country" "Growth" "Gov.Spend" "Invest" "Colony" "Open.Market"

[8] "Institutions"

To make sure that R associates the variable names with the variables, use the function attach() after making changes to the dataset.

Handy Tip:  in the last three commands, we typed names(mydata) three times.  In the R Console, it remembers the last commands you have used.  So, instead of typing that command three times, we could have pressed the up arrow key on our keyboard to recall the last command used.  The more times you press the up arrow key, the further back you can go to retrieve previously used commands.  (Similarly, if you are reviewing previously used commands, you can also press the down arrow button on your keyboard to scroll through commands in the opposite direction).

Recoding variables

Suppose you want to create a dummy variable that is coded 1 if the observation meets certain criterion and 0 otherwise. For example, say you want to create a dummy variable based on the variable Open.Market, which is the fraction of years (from 1965 to 1990) during which a country is rated as an open market. Open.Market.Dummy should be equal to 1 when Open.Market is greater than 0.5 and equal to 0 when Open.Market is less than 0.5. By using the notation mydata\$, you can create a new variable within the existing dataset mydata.

> mydata\$Open.Market.Dummy <- as.numeric(Open.Market>0.5)

The as.numeric() function creates or coerces objects to be numeric.  Since we specified a condition as the argument, the resulting value of that condition will create a numeric value of 1 (if the condition is true) or 0 (if the condition is false).  Suppose now that you want to create a new variable, Open.Market.Cat, which has three categories: 1 when Open.Market is less than 0.33, 2 when Open.Market is between 0.33 and 0.66 and 3 when Open.Market is greater than 0.66.  The following lines assign the specified values (1,2, or 3) to the new variable mydata\$Open.Market.Cat depending on whether the condition(s) in the brackets is true or not:

> mydata\$Open.Market.Cat[Open.Market < 0.33] <- 1

> mydata\$Open.Market.Cat[Open.Market >= 0.33 & Open.Market <=

0.66] <- 2

> mydata\$Open.Market.Cat[Open.Market > 0.66] <-3

In addition to creating new variables, you may also want to delete existing variables from your data set. To do so, assign that variable the value of NULL:

> mydata\$Open.Market.Dummy <- NULL

Although R is not an ideal environment for entering and coding data manually, you may change the values of specific data cells by specifying the column and row numbers. For example, notice that Australia is not coded as a former colony (Colony has a value of 0). To change this, specify the column and row numbers of the observation you wish to change. Australia is the third country in the dataset, and Colony is the fifth variable. Then assign that cell a value of 1:

> mydata[3,5] <- 1

To save your data as a text file, use the command write.table(). Specify the name of the data object you wish to export (in this case, mydata), the directory where you wish to save the file, and the separation method.