ODBC and Statistical Software

When using ODBC and statistical software, there are a few database management concepts that you should be aware of -- tables and the Structured Query Language (SQL). Familiarity with these concepts can help you understand the structure of the database files you may be using and how they are accessed.

Most current spreadsheet and database programs organize data in a multi-tiered or three-dimensional fashion such that the data are broken down into various components, usually referred to as tables. For example, most modern spreadsheet programs have a workbook which contains individual sheets or pages each of which can contain different facets of a single data file (e.g. an accounting workbook might have a sheet for income, and another for expenditures, etc.). The majority of database management software can also organize the data in a multidimensional fashion, with each data file containing a number of different tables, each of which contains different elements of the entire database. So, if you had a sales database, for example, the data file might contain one table for customers' addresses, another for prices and products, another for tracking shipping, etc. So, instead of a data file containing only a single data set, it is possible for a single database to contain many data sets, or many different permutations of the same data set. For example, a single data set could contain data along with various different views and query tables that are just variants of the original data. It is therefore vital that you know the exact structure of a data file before you attempt to read the data into another program. Each table in a database usually has an assigned name, and you must know the name of the table you wish to read using ODBC to be able to transfer the data properly.

SQL, a standardized database language designed to select data from one or more database tables, is another important database management concept. SQL is used by both SPSS and SAS to specify the range of variables and/or observations that you wish to read from an ODBC data source. This can save time in reforming a data file in SAS or SPSS by reading in only those variables and/or observations that you are interested in, saving you from having to transform the data after you have read the data. Both SPSS and SAS use standard SQL statements to read data from an ODBC source, by using the SELECT command to specify the variables you wish to access, and the WHERE command to specify the observations.

The discussions of SAS and SPSS below include examples of reading database files with multiple tables and SQL to familiarize you with these database management concepts.

SAS and ODBC

SAS uses the SAS/ACCESS module and the PROC SQL procedure to access external ODBC data sources. As with most SAS procedures, there is no point-and-click interface available and all of the steps will have to be accomplished using the SAS command language. Using PROC SQL, you can access any ODBC data source in SAS. SAS/ACCESS can be used without PROC SQL to access many popular PC-file formats (e.g. Lotus, dBASE, Excel, etc.) but it is considerably more difficult to configure and use (see SAS/ACCESS Interface to PC File Formats for more information). Remember, if you are importing data from a spreadsheet, you must first assign a name to the data you wish to import, as discussed in a previous section.

Imagine, for example, that we followed the example in the section "Creating an Excel Data Source" such that we have an Excel file called temp.xls in the C:\WORK subdirectory. You have already assigned the name data to a selection of data and created an ODBC data source called temp_xls for this file. To read this file, you would do the following:

1.      Launch SAS.

2.      In the SAS Program Editor, type the following commands:

PROC SQL;
CONNECT TO ODBC(DSN='temp_xls');
CREATE TABLE temp_sas AS
SELECT * FROM CONNECTION TO ODBC(SELECT * FROM data);
QUIT; 

3.      Highlight these commands and from the main menu select Run->Submit.

In this example, we use a file-specific ODBC data source (temp_xls) to read the data. You may also use a generic ODBC data source to read the same file by doing the following:

1.      In the SAS Program Editor, type the following commands:

PROC SQL;
CONNECT TO ODBC(PROMPT);
CREATE TABLE temp_sas AS
SELECT * FROM CONNECTION TO ODBC(SELECT * FROM data);
QUIT;

2.      Highlight these commands and from the main menu select Run->Submit.

3.      An ODBC dialogue box will appear and prompt you for the type of file you wish to access. From the list, select Excel Files and then click OK.

4.      A final dialogue box will then appear, asking for the name and location of the file you wish to retrieve. You can either use the mouse to choose the appropriate drive and subdirectory, or just type the name and location of the file in the top left hand corner of the dialogue box. In this example, you would type C:\WORK\TEMP.XLS and then click OK.

The PROC SQL procedure is generally used to issue SQL commands to SAS data sets but if you have the SAS/ACCESS module installed, you can use this procedure to issue queries to other types of data files including ODBC data sources. By default, an SQL query accesses a remote data file and then responds with a list of the results of your query. In these examples, we asked SAS to select all of the variables (SELECT *) from the ODBC data source temp_xls. The reason there are two SELECT * statements in this example is because the first describes what PROC SQL will select, and the second describes what the Excel ODBC driver will select. While SAS is fully SQL-compliant, many ODBC drivers can only understand rudimentary SQL commands (i.e. SELECT and WHERE), so both commands must be used to synchronize PROC SQL with the ODBC driver.

If you want to read the results of a PROC SQL into a SAS data set for analysis, you must also use a CREATE TABLE subcommand as above, followed by the name of the data set you wish to create. In this example, we access the Excel file called temp.xls and then select all of the variables and cases in that spreadsheet and store them in a SAS temporary data set called temp_sas.

Now let us try a more complicated example of using PROC SQL to read an ODBC data source. This time, imagine that you are reading a Microsoft Access database file. The file is called temp.mdb, it is located in the C:\WORK subdirectory, and you have already created an ODBC data source for the file called temp_mdb. The file has two tables of data (table1 and table2), each with five variables (table1 contains ID, V1, V2, V3, V4; table2 contains ID, V5, V6, V7, V8), where ID is a primary index variable (i.e. it is used to order the data), and there are 10 observations. The ID variable tracks the case number (ranges from 1 to 10) and the variables V1-V8 are numeric data. Furthermore, you want extract the last five cases from both tables, and variables V1 and V2 from Table1, and V7 and V8 from Table2. So the final data set would be composed of five cases with four variables: V1, V2, V7, and V8. The syntax you would use to complete this task would be:

PROC SQL;
CONNECT TO ODBC(DSN=temp_mdb);
CREATE TABLE temp_sas AS
SELECT * FROM CONNECTION TO ODBC(SELECT V1, V2, V7, V8
FROM TABLE1, TABLE2 WHERE TABLE1.ID=TABLE2.ID AND TABLE.ID > 5);
QUIT;

This example creates a new SAS data set called temp_sas using the CREATE TABLE subcommand. As you can see, the SELECT statement issued to the Access ODBC driver is much more complicated than in our previous examples. First, we use the SELECT command to list the variables we are interested in (V1, V2, V7, V8). Next, we use the FROM command to list the tables from which these variables are to be drawn (Table1 and Table2). A WHERE command is used to specify which records/observations we are interested in. In this example, the database file is indexed so we must specify that the index shared by each table is the same. Each variable/field can be referred to by including the appropriate table so TABLE1.ID=TABLE2.ID can be understood to mean that the ID variable in Table1 is set to be equal to the ID variable in Table2. Finally, an additional condition is set using the AND condition to include only those cases where ID is greater than five (i.e. the last five cases).

While this example is much more complicated than the most common uses of SAS and ODBC, it does demonstrate how PROC SQL can be used. Explaining all of the command choices available is beyond the scope of this discussion. For more information, consult SAS SQL Query Window User’s Guide, and SAS/ACCESS Software for PC File Formats.

SPSS and ODBC

SPSS has a very easy to use point-and-click interface for accessing ODBC data sources. Remember, however, that unlike SAS, SPSS can read only ODBC data sources and cannot serve as an ODBC data source. This means you can use ODBC to read data into SPSS, but you cannot use ODBC to read data from SPSS files. Also unlike SAS, SPSS has the ability to easily read many different types of files, and more often than not this would meet the needs of most users. If, however, you wished to access an ODBC data source, SPSS can do so.

Once again, imagine you have an Excel file called temp.xls in the C:\WORK subdirectory which has a named selection called data and you have created an ODBC data source for this file called temp_xls, as in the example discussed in "Creating an Excel ODBC Data Source." To read this file into SPSS, you would do the following:

1.      Launch SPSS.

2.      From the main menu, select Database ->New Query. The Database Wizard dialogue box will appear.

3.      In the Database Wizard dialogue box, select the temp_xls data source and then click OK.

4.      You see the Select Data dialogue box, shown below:

5.      In the box labeled Available Tables, SPSS will list all of the named selections in the spreadsheet. In this example, we have assigned a name to one selection in Excel, the one we called data. If there were more than one named selection, each one would be listed in the Available Tables box. To select a table, click on the table you wish to access and drag it to the right-hand box labeled Retrieve Field In This Order. In this example, you would click on the table data and drag the table to the right-hand box before releasing the left mouse button.

6.      The right hand box will then display a listing of all of the variables/fields contained in the table you selected. You can deselect variables by double-clicking on each variable name. By default, the ODBC driver will assign each column a name such as F1, F2, F3... if there were no variables names assigned to the original data in Excel. These field names correspond to the original column order (i.e. column A equals F1, column B is F2, etc.). Click NEXT. The Limit Retrieved Cases dialogue box will then appear

7.      The Limit Retrieved Cases dialogue box can be used to create a filter for the data, so that certain cases could be dropped during the transfer process. This function behaves much like the SELECT IF command in SPSS. In this example, we want to read all of the observations, so we do not have to use this option, so at this point you would simply click FINISH to import the data into SPSS.

SPSS will then show the results of this command. Having described a simple example, we can now explore some of the other options available. Below is a slightly more sophisticated example where a filter is applied to this process.

Imagine that you have a Microsoft Access database file called temp.mdb. The file is located in the C:\WORK subdirectory and you have already created an ODBC data source for the file called temp_mdb, as in the example in the section "Creating an Access ODBC Data Source." The file has two tables of data (Table1 and Table2), each with five variables (table1 contains ID, V1, V2, V3, V4; table2 contains ID, V5, V6, V7, V8), where ID is a primary index variable (i.e. it is used to order the data), and there are ten observations. The ID variable tracks the case number (ranges from 1 to 10) and the variables V1-V8 are numeric data. Furthermore, you want to extract only the last five cases for variables V7 and V8 from Table2. So the final data set would be composed of five cases with two variables. You would need to do the following:

1.      From the main menu, select New Query... The Database Wizard dialogue box will appear.

2.      In the Database Wizard dialogue box, select the temp_mdb data source by clicking on the name once, and then click NEXT. The Select Data dialogue box will appear.

3.      In the Select Data dialogue box, select the table you wish to draw the data from in the box in the upper left hand corner. In this example, you want to see a list of all of the variables/fields available in Table2. Click on the +next to the name of Table2 to see a listing of the tables.

4.      Next, select the fields V7 and V8. You would use the scroll bar to find these fields/variables and then select them by double-clicking on each variable.

5.      Now you need to set the WHERE condition. Click on the NEXT button. The Limit Retrieved Cases dialogue box will appear.

6.      In the Limit Retrieved Cases dialogue box, click the Select where case satisfies condition radio button in the top middle part of the dialogue box.

7.      Double click ID in the Fields box.  Table 2:ID appears in the Expression 1 column.  Click the cell right below the Relation column and type >.  Type 5 in the cell right below the Expression 2 column.

8.      Click the FINISH button when done setting the condition to import the data.

The biggest advantage of using SPSS to read data by using the windows interface is that it provides you with an easy to use interface. However, you can also use the GET CAPTURE command to perform the same tasks described above. The following discussion demonstrates how you can use the command to replicate the results from the previous interactive examples, as well as shows how you could implement more sophisticated queries. For more information on the GET CAPTURE command, consult the SPSS Base 7.0 Syntax Reference Guide.

Assume that you have an Excel spreadsheet file with an assigned ODBC data source called temp_xls with a named selection called data, as used in the first example in this section and in the section "Creating an Excel ODBC Data Source." You could read this file into SPSS interactively as described above, or you could open a Syntax window in SPSS and use the GET CAPTURE command instead:

1.      From the main menu, select File->New->Syntax.

2.      In the Syntax window, type the following commands:

GET CAPTURE ODBC
/CONNECT='DSN=temp_xls'
/Select * FROM `data`.
EXECUTE. 

3.      Highlight these commands by either using the mouse to click-and-drag over the text or choose Edit->Select All from the main menu.

4.      Click the RUN button (a right-pointing triangle on the button bar) or choose Run->Selection from the main menu.

The syntax for this example is similar to the SAS examples previously discussed. Both SAS and SPSS use SQL commands to query external ODBC data sources. In this example, we use the SELECT * command to select all of the variables from the table called data.

We could also use the WHERE command to set a condition to determine which observations are read into SPSS, just as we did in the second example above when we read the last five cases from a Microsoft Access database with two tables. Again, assume that you had an Access database with two Tables (Table1 and Table2), each table containing four variables (V1-V4 in Table1, V5-V8 in Table2), and each table shares an index variable called ID. And just as in the interactive example above, suppose you want to read the last five cases for V7 and V8 into SPSS. To do this using the GET CAPTURE command:

1.      From the main menu, select File->New->Syntax.

2.      In the Syntax window, type the following commands:

       GET CAPTURE ODBC
       /CONNECT='DSN=temp_mdb'
       /Select V7, V8 FROM Table2
       WHERE ID > 5.
       EXECUTE.

3.      Highlight these commands by either using the mouse to click-and-drag over the text or choose Edit->Select All from the main menu.

4.      Click the RUN button (a right-pointing triangle on the button bar) or choose Run->Selection from the main menu.

Now let us consider a more complicated variation of this example, taking advantage of the GET CAPTURE command's ability to read multiple tables. Assume that we are using the same Microsoft Access database above, but suppose you want to read two variables from Table1, V1 and V2, and two variables from Table2, V5 and V6. Furthermore, you are only interested in the first five cases for each table. This time you would use the following steps: 

1.      From the main menu, select File->New->Syntax.

2.      In the Syntax window, type the following commands:

       GET CAPTURE ODBC
       /CONNECT='DSN=temp_mdb'
       /SELECT Table1.V1, Table1.V2, Table2.V5, Table2.V6 FROM 
       Table1, Table2 WHERE Table1.ID=Table2.ID AND 
       Table1.ID < 5.
       EXECUTE.

3.      Highlight these commands by either using the mouse to click-and-drag over the text or choose Edit->Select All from the main menu.

4.      Click the RUN button (a right-pointing triangle on the button bar) or choose Run->Selection from the main menu.

The major difference between this and the previous GET CAPTURE examples is the use of the SELECT and WHERE SQL subcommands. Also, if you look at these subcommands, you will see that we refer to each variable in the multiple-table Access database by a two-level name composed of the table name and the variable name, so Table1.V1, for example, refers to the V1 in Table1. You also have to list all of the tables from which you are drawing data with the FROM parameter. Finally, the WHERE subcommand is used to synchronize the index/sort variable (ID) in each table and the AND conditional statement is used to specify which cases you wish to read. This relatively simple example of how the GET CAPTURE command can be used to read multidimensional data files demonstrates some of the many options you have when reading ODBC data sources into SPSS.


Next: Conclusion
Prev: Setting Up ODBC Data Sources
Up: Contents