Writing a SAS Program: The DATA Step

A SAS program consists of two steps: DATA steps and PROC steps.

In the DATA step you may include commands to create data sets and programming statements to perform data manipulations. The DATA step begins with a DATA statement.

In the PROC (Procedure) step you invoke SAS procedures from the library to run statistical analysis on a given data set. The PROC step begins with a PROC statement.

These steps contain SAS statements. An important feature of the SAS language is that every SAS statement ends with a semicolon (;). Without a semicolon a SAS statement is incomplete.

DATA Statement

DATA dataname

The first word, DATA, tells SAS that you want to read a data file and store it in a SAS data set with a name you specify. Replace dataname with an appropriate SAS name (32 or fewer characters), e.g. trial, company, drug, behavior. In the example given below, "dataname" is replaced by the name anxiety. Note the semicolon at the end of the statement.

DATA anxiety;

INPUT Statement

INPUT var1 column# var2 column# var3 column# ...... varn column#;

The INPUT statement tells SAS the names of the variables and the column numbers read on a specified line. Variable names in SAS can contain from one to eight characters. They may contain numbers but must begin with a letter. If your data contain more than one line per case (observation), indicate the line number before specifying the variables on that line.

INPUT id 1-3 company 8-10 #2 insal 6-10 finalsal 18-23 #3 retire 15-19;

The above INPUT statement informs SAS that there are three lines of data for each subject/observation. The lines are indicated by a # sign.

INPUT statements need not contain column numbers provided there is a space between each variable value on the data line. This is referred to as free format as opposed to the fixed format where you specify the column numbers. If a variable contains a character value, indicate it by a $ sign after the variable name. If you are choosing the free format, a character variable should not exceed eight characters and should not include embedded blanks. Free format may not be a good idea if you have a large number of variables.

If there are decimal points in your data, you may enter the decimal points as they are or omit them when entering the data and later indicate in the INPUT statement that a given variable has a specified number of decimal points. Suppose you have a variable gpa in your study and the value is to be indicated with three digits of which the last two are decimal places, e.g., 3.89 If you decide to enter the decimal points in your data file, indicate this in your INPUT statement as: INPUT gpa 1-4;. Another choice is to leave out the decimal (389) and later indicate in the INPUT statement that the variable gpa has two decimal points: INPUT GPA 1-3 (.2);. This means that the variable gpa is given in col. 1-3, and the last 2 places are decimal places.

INFILE Statement

INFILE 'path/filename';

If the data is stored in a separate file (clas.dat in the above example) an INFILE command is used to read the data set into the SAS program, e.g., INFILE '/pathname/clas.dat';. Replace the pathname with the name of the directory in which the data are stored. Data files stored in another directory can also be read through the INFILE command. Data files in another user's directory can also be accessed in the same way, provided proper file protection is set for the source file. SAS can also read several data files from within the same program file.

The INFILE command is usually entered immediately after the DATA line.

DATA anxiety;
INFILE '/usr1/jdoe/clas.dat';

Replace /usr1/jdoe with an appropriate pathname.

Note: Unix is case sensitive. When you are referring to an external file, you must use the correct case. For example, "clas.dat" and "Clas.dat" are two different filenames in Unix and you must match the case correctly whenever referring to another file with single quotes.

CARDS Statement

The CARDS statement tells SAS that data lines are included next. The ends of the data lines are indicated by a semicolon at the beginning of a new line, e.g.,

  

	CARDS;

	25 32 82 32 1

	22 42 .  36 2

	;

The CARDS statements are usually entered toward the end of the DATA step.

  

	DATA anxiety;

		INPUT id 1-3 sex 3 test1 4-5 test2 6-7 test3 8-9;

		IF test1=99 THEN test1=.;  

		avscore=(test1+test2+test3)/3; 

	CARDS;

	0011993240

	0022424548

	;

Missing values in a data set can be represented either by a blank or by a period. If you choose a free format (leaving a space after each variable in the data set and not specifying the column numbers in the INPUT statement) make sure you represent missing values with a period. When SAS encounters a blank or a period in a data set the system regards it as a missing value.

One can assign a missing value to a variable (e.g. 9, 99, 999, 000) and let SAS know which value for a given variable is assigned as missing. Suppose, for a variable mathscor, 99 is assigned as the missing value. Immediately after the INPUT statement you may specify:

IF mathscor=99 THEN mathscor=.;

This statement will assign a missing value whenever it encounters a value of 99 within the variable mathscor.

SAS Functions

In the DATA step you can use a number of SAS functions, e.g., MEAN (computes arithmetic mean), SUM (calculates sum of arguments), VAR (calculates the variance), ABS (returns absolute value), SIN (calculates sine), LOG (produces the natural logarithm), SQRT (calculates the square root).

For instance, to create a new variable final which will be the arithmetic mean (average) of the 3 scores (variables: test1, test2, and test3), you would use the following command:

final=MEAN(test1,test2,test3);

There are a number of SAS operators that could be used in a DATA step, e.g.: ** (raise to a power), * (multiplication), / (division), + (addition), - (subtraction), = or EQ (equal to), >= or GE (greater than or equal to), AND, OR, NOT.

IF ID <= 20 THEN group=1;    ELSE group=2;

In the above example, a new variable group is created with 2 categories. Observations with id numbers 20 or lower form group 1 and observations with id numbers greater than 20 form group 2. For details see SAS Language: Reference Version 8, and SAS Language and Procedures: Usage Version 8.

LABEL Statement

You can use LABEL statements either in DATA steps or in PROC steps to give labels to variables.

LABEL variable='variable label';

Replace variable with the name of the variable, and variable label with the label you want to assign to the variable. A SAS variable is limited to eight characters, whereas a label assigned to a SAS variable can have up to 40 characters, including blanks. Labels should be enclosed in quotes, and the LABEL step is terminated by a semicolon, e.g.,

LABEL exp='years of computer experience';
LABEL mathscor='score in mathematics';

FORMAT Statement

The FORMAT statement associates formats with variables in a DATA step. In the above example the variable sex has two values (1,2) and school has 3 values (1,2,3). To associate these values with appropriate value labels, use the format statement. The FORMAT statement may be used in a DATA step or in a PROC step. However, when you define format with PROC FORMAT it appears as the very first line in a SAS program.

PROC FORMAT;
   value $sex 'M'='male' 'F'='female';
   value school 1='rural' 2='city' 3='suburban';

Once you define the format as above, you should associate the format with the variable/s through a format statement after the INPUT line.

PROC FORMAT;
   value $sex 'M'='male' 'F'='female';
   value school 1='rural' 2='city' 3='suburban';
DATA anxiety;
   INPUT id 1-3 sex $ 4 school 5 test 6-7;
   FORMAT sex $sex. school school.;

Next: Writing a SAS Program: The PROC Step
Up: Table of Contents
Prev: Sample Data