Writing a SAS Program: the DATA Step
A SAS program consists of two steps: DATA steps and PROC steps. In the DATA step a user may include commands to create data sets, and programming statements to perform data manipulations. The DATA step begins with a DATA statement. In the PROC (Procedure) step you invoke SAS procedures from its library to run the statistical analysis on a given data set. The PROC step begins with a PROC statement. These steps contain SAS statements. An important feature of SAS language is that every SAS statement ends with a semicolon (;). Without a semicolon a SAS statement is incomplete.
Organizing your Data for Analysis
SAS uses data organized in rows and columns. Observations are represented in rows and variables are represented in columns. An observation contains information for one unit of analysis (e.g., a person, an animal, a machine). Variables are information collected for each observation, such as name, score, age, income, educational level, etc. When data in files are arranged in rows and columns, they are called observations-by-variables, or rectangular data files.
Suppose you have three test scores collected from a class of 10 students (five males, and five females). Each student was assigned an identification number. The information for each student you have is an identification number, gender, and score for test one, test two, and test three. Suppose the data layout is as follows:
01 f 83 85 91 02 f 65 72 68 03 f 90 94 90 04 f 87 80 82 05 f 78 86 80 06 m 60 74 64 07 m 88 96 92 08 m 84 79 82 09 m 90 87 93 10 m 76 73 70
Notice that in the above data layout at least one blank space is left after each variable. In this instance you do not have to specify in which column a variable appears. SAS automatically considers a blank after each variable as a delimiter. In other words, if the data are space delimited, SAS will automatically know where each variable begins and ends. It is optional whether to leave a space between variable values. For example, you may choose to enter the data as following:
In this instance you need to assign column numbers to variables when using SAS to read the data. This is called a fixed format style input (where the variables across subjects are consistently in the same column). Format styles are discussed later in this document. Whichever format you choose, as long as you convey the format correctly to SAS, should not have any impact on the analysis. In the above layout each observation has only one line (record) of data. In another situation you may have multiple records per observation.
The first word, DATA, tells SAS that you want to read a data file and store the data in a SAS data set you specify. Replace dataname with an appropriate SAS name (32 or fewer characters), i.e., trial, company, drug, behavior. In the example given below dataname is replaced by the name ANXIETY. Note the semicolon at the end of the statement.
If the data file is stored in a separate file an INFILE command is used to read an ASCII text file into the SAS program. Replace pathname with the name of the directory in which the data are stored. SAS can read several data files from within the same program file, so you can have multiple DATA steps in a single SAS program file.
The INFILE command is entered immediately after the DATA line.
Replace pathname with the drive and path to your data file, and filename with the name of the file. For example, if you were reading a file called clas.dat from the c:\temp\ , the syntax would be:
The INPUT statement tells SAS the names of the variables and the column numbers that can be read on a specified line. Variable names in SAS can contain from 1 to 32 characters. They may contain numbers but must begin with a letter. If your data contains more than one line per observation, indicate the line number before specifying the variables on that line, for example:
The above INPUT statement informs SAS that there are 3 lines of data for each observation. The lines are indicated by a # sign.
The INPUT statement does not have to contain column numbers provided there is a space between each variable value on the data line. This is referred to as free format as opposed to the fixed format in which you specify the column numbers. If a variable contains character value, indicate it by a $ sign after the variable name. If you are choosing the free format, a character variable should not exceed 8 characters and should not contain embedded blanks. Free format may not be a good idea if you have a large number of variables.
If there are decimal points in your data, you may enter the decimal points as they are or leave them out when entering the data and later indicate in the INPUT statement that a given variable has specified number of decimal points. Suppose you have a variable GPA in your study and the value is to be indicated with 3 digits, the last two of which are decimal places. e.g. 3.89 If you decide to enter the decimal points in your data file indicate this in your INPUT statement as: INPUT GPA 1-4; Another choice is to leave out the decimal (389) and later indicate in the INPUT statement that the variable GPA has 2 decimal points: INPUT GPA 3.2;. This means that the variable GPA is given in col. 1-3, and the last 2 places are decimal places.
The input format can also be written in a shorter form with a mixed style column and formatted input.
In this case the program will read the variable ID from columns 1-2 and SEX from column 3. The next two variables, EXP and SCHOOL, have a width of 1 column each and start at column 4. The variables C1 through C10 (10 variables in sequential order) have a width of 1 column each and start immediately after the variable SCHOOL. If you want to read only the variables ID and the last 2 variables (MATHSCOR, COMPSCOR) you could write the INPUT statement as:
The @ moves the pointer to column 26 and reads two variables with a width of 2 columns each.
The DATALINES statement is used only if the data are embedded within the SAS program file. The DATALINES statement tells SAS that data lines are included in the command file. Indicate the end of data lines by a semicolon at the beginning of a new line. For example:
DATALINES; 25 32 82 32 1 22 42 . 36 2 ;
The DATALINES statement is entered toward the end of the DATA step.
IF-THEN and SAS Functions
Missing values in a data set can be represented either by a blank or by a period. If you choose a free format (leaving a space after each variable in the data set and not specifying the column numbers in the INPUT statement) make sure you represent missing values with a period. When SAS encounters a blank or a period in a data set the system regards it as a missing value. You can assign a missing value to a variable (e.g. 9, 99, 999, 000) and let SAS know which value for a given variable is assigned as missing. Suppose, for a variable MATHSCOR, 99 is assigned as a missing value. Immediately after the INPUT statement you may specify:
This statement will assign a missing value whenever it encounters a value of 99 within the variable test1.
You can also use the IF-THEN statement to recode any variable. For example, if you wanted to collapse the test scores into a dichotomous variable with students who scored 90 points receiving a score of one and the rest being assigned a score of zero, the syntax would be:
There are a number of SAS operators which could be used in a DATA step, e.g. ** (raise to a power), * (multiplication), / (division), + (addition), - (subtraction), = or EQ (equal to), >= or GE (greater than or equal to), AND, OR, NOT.
In the DATA step you can also use a number of SAS functions to transform existing variables or create new variables. There are too many different SAS function to list them all here, but some of the most commonly used ones are: MEAN (arithmetic mean), SUM (sum of arguments), VAR (variance), ABS (absolute value), SIN (sine), LOG (natural logarithm), SQRT (square root).
For instance, to create a new variable called final with the arithmetic mean (average) of the 3 scores, you would type:
For further details refer to SAS Language Reference: Concepts, Version 8.
You can use LABEL statements either in the DATA step or in the PROC step to give labels to variables.
Replace variable with the name of the variable, and variable label with the label you want to assign to the variable. A SAS variable is limited to 32 characters, whereas a label assigned to a SAS variable can have up to 256 characters including blanks. Labels should be enclosed in quotes, and the LABEL step is terminated by a semicolon. For example,
LABEL exp='years of computer experience';
LABEL mathscor='score in mathematics';
PROC FORMAT Statement
FORMAT statements associate formats with variables in a DATA step. For example, in a data set, the variable SEX has two values (f,m) and the variable SCHOOL has 3 values (1,2,3). To associate these values with appropriate value labels use the format statement. The FORMAT statement may be used in a DATA step or in a PROC step. However, when you define format with PROC FORMAT it appears as the very first line in a SAS program. Note that a dollar sign ($) required for a character variable format and character variable values are entered in single quotes.
value $sex 'm'='male' 'f'='female';
value school 1='rural' 2='city' 3='suburban';
Once you defined the format as above, you should associate the format with the variables through a format statement after the INPUT line.
INPUT id 1-3 sex $ 4 school 5 test 6-7;
FORMAT sex $sex. school school.;
To execute any series of statements that read or transform data in any way, you must also include a RUN statement to execute those commands unless they are followed by a PROC step (explained below). For example:
INPUT id 1-3 sex $ 4 school 5 test 6-7;
FORMAT sex $sex. school school.;
Comments are provided for documentation purposes. Statements enclosed in /* .... */ or *...... are ignored by SAS and will not be used while executing the program.
* So is this
/* This comment
spans several lines and ends with the asterisk-slash */
Next: Writing a SAS Program: the PROC Step
Up: Table of Contents