J. Scott Long - Indiana University
Departments of Sociology and Statistics
Home Teaching Research SPost Commands Workflow of Data Analysis Collecting Statistical Results PathView Parallel Coordinate Graphs Contact and vita Links Recommendations FTP downloads
Categorical Data Analysis at IU: Stat 503 and Soc 650

Fall 2011 - updated October 27, 2011

Stat503 and Soc650 is a second course in in applied statistics. The prerequisite is a course (e.g., Soc 554, Stat 501) in regression models for continuous dependent variables such as linear regression, seemingly unrelated regressions, and systems of simultaneous equations. Categorical Data Analysis deals with regression models in which the dependent variable is categorical. Such models include probit and logit for binary outcomes, ordered logit and ordered probit for ordinal outcomes, multinomial logit for nominal outcomes, and Poisson regression and related models for count outcomes. This web page serves as the syllabus for the class.

Assignment due dates · Assignments · Workflow requirements · Codebooks ·

Help ·Stata stuff · Grades and getting an A+ · LAN and batch files · Computing · Books ·

Enrolling & time conflicts · FTP downloads ·

News and policies

  • Starting 2011-10-27, IU students can run Stata for free over the internet. For details go here. This is a preproduction service and there are a few tweaks you have to do to let it see your local machine. 27Oct2011
  • Assignments 5 and 6 on the BRM: These assignmets are availble on the FTP site. Before doing too much work on Assignment 5 make sure that you understand any comments made on your graded Assignment 5 (which should be returned during lab on October 6). 29Sep2011
  • Change to Thursday Office hours: Office hours on Tuesday will be 10-12. Thursday will be 11AM-12:45PM. While I will often be available at 10AM on Thursday's, check with me to be sure. My office is Ballantine 842B which is directly across from the elevator. Enter 842 (no need to knock) and my office is at the end of the hall. If I'm talking with someone, please let me know you are waiting. 20Sep2011
  • Turning in assignments: Assignments are to be turned in 30 minutes before the end of lab on the due date. Pedagogically it is critical to complete assignments on time. Late assignments are penalized 25%. If there are special circumstances, you should get written approval for a late assignment from Professor Long; turn in e-mail approval along with your late assignment. 27Aug2011
  • LAN access: All students in the registrar's enrollment list as of 2011-09-01 08.32 have been giving R/W permission to the course LAN. If you enrolled late, contact sochelp at indiana dot edu (cc jslong at indiana dot edu) and request R/W access to the LAN for Stat 503/Soc 650. Contact the TA for you lab for further help.
  • Lectures: Lectures are Tu/Th 1:00 to 2:15. Please arrive on time and be ready to start at 1:00. Thanks! 27Aug2011
  • Teaching Assistants: The AIs for Fall 2011 are Maria Kaylen (mkaylen at indiana dot edu), Nate Birkhead (nbirkhea at indiana dot edu), and Mike Vasseur (mvasseur at indiana dot edu) 27Aug2011
  • Computing labs: You have signed up for one of three computing labs sections. Each section meets twice a week for two hours. The lab instructors might provide a short presentation or discuss the assignments at the start of each lab. Teaching assistants will be available for 90 minutes each days; teaching assistants will not be available the last 30 minutes of lab. Mike Vasseur's lab is Tu/Th 2:30-4:30 SB 230. Nate Birkhead's lab is Tu/Th 4:00-6:00 GY 226. Maria Kaylen's lab is Tu/Th 6:00-8:00 BH 107. 27Aug2011
  • SugarSync and Dropbox: Dropbox is a very handy cloud utility for moving my own files as well as for sharing with collaborators. While Dropbox is easier for some things, I find SugarSync to be more powerful and easier for many things; it is less expensive if you want to buy extra storage. 27Aug2011
  • CLASSPAK: If the bookstore doesn't have copies of the ClassPak, go to the textbook register at the IU Bookstore and purchase a voucher. They will have a copy of the ClassPak by 3PM the next day. If you have problems, e-mail Kathy Parker cparker@indiana.edu. 27Aug2011

Due dates - assignments are due 30 minutes before lab ends

Assignments in the form of Word files are available here. You should add your answers to the assignment file and place the renamed, completed file in your LAN directory. Details are provided in lab.

  • Assignment 1: Math review. Due day 3, September 6, 2011.
  • Assignment 2: Picking your variables. Due day 4, September 8, 2011.
  • Assignment 3: Data cleaning. Due day 6, Thursday, September 15, 2011.
  • Assignment 4: LRM. Due day 9, Tuesday, September 27, 2011.
  • Assignment 5: BRM-1. Due day 11, October 4, 2011.
  • Assignment 6: BRM-2. Due day 14, October 13, 2011.
  • Assignment 7: Testing and Fit. Due day 17, October 25, 2011.
  • Assignment 8: margins command. Due day 20, November 3, 2011.
  • Assignment 9: ORM. Due day 25, November 22, 2011 (Tuesday before break).
  • Assignment 10: MNLM. Due day 28, December 6, 2011.
  • Assignment 11: Count models. Due by 5PM Tuesday final's week, December 13, 2011.

Workflow requirements for assignments

An essential part of being an effective researcher is to develop an organized workflow that allows you to organize your efforts and later replicate the results you have already completed. Since this class is an applied course in data analysis, a portion of your grade is based on the workflow you use in completing your assignments. More general information and a detailed treatment of workflow is available at Long’s workflow page and his book The Workflow of Data Analysis Using Stata. For this class you are not required to implement the full workflow from the book, but you are encouraged to improve your workflow as your time and interest allow. Early in the fall Long will present an overview of workflow in a Workshop in Methods presentation.

Requirements for Stat 503/Soc 650

  • 1. Keep a research log: A research log is a record of your progress similar to a journal. In you research log you should record your work on each assignment. An example is posted to the course LAN to show you what you log might look like. (Note that the research log is distinct from Stata log file that you create with log using dofilename, text replace)
  • 2. Have an organized file structure: Organizing files so you can find your work and know what has been finished is essential for an effective workflow. The file structure you are to use in this class can be created using that batch file cda_workflow.bat. Copy this file to your course folder or directory and run it to create the directories for the course. Three types of directories are created: work directory, support directory, and assignments directory. The work directory is where your current work goes (e.g., assignments you have not finished). The support directory contains examples or other files that are not critical to completing your assignment but support your work. You are not graded on the organization of your support directory. Your assignments directory holds completed assignments. Your work on an assignment should be moved to this directory when it is completed (see #3 below).
  • 3. “Post” files that are done (and do not change them!): A fundamental task of workflow is keeping track of which work is finished and never changing completed work. You can do new work, but you cannot change work that is completed. Facilitating this is a process of “posting” files. When work is completed it should be moved from your work directory and posted to the appropriate assignment directory. Posted files should never be changed. If you need to change something that has been posted, create a new do-file that creates a new log file (see #4 below). Before you turn in an assignment, all associated files must be posted.
  • 4.Follow file naming conventions: Cheap storage makes accumulating a massive number of files too easy. To keep track of these files you must use a standardized names. Files names should use the following formation where you replace the bracketed values with the appropriate information:
            [your initials]-503-a[assignment #]-[step #]-[task].[extension]
    For example, my work for categorical data analysis might include:
            mrv-503-a2-01-create-variables.log
    This is the log file from Mike R. Vasseur (mrv), for Stat 503 (503), assignment 2 (a2), the first program for the assignment (01), in which you are creating variables (create-variables). Or,
            mrv-503-a4-02-lrm-analysis.do
    is the second do-file for assignment 4 where Mike is running a LRM. No two files should have the same name. If you post a file and need to make a change, the revised file should add a version number (e.g., mrv-503-a4-02v2-lrm-analysis.do) This way you have a full record of what you have done in the past and need not guess as to which file was the last one your used.
  • 5. Use legible and robust do-files: You should be able to run do-files on another computer without making changes. At a minimum this means that you should create a separate profile.do to avoid directly coding directory structures in your do-files. Your do-files need to be clear and easy to understand. Look at the examples on the LAN and model your do-files after those. Or use the examples from the lecture do-files or the lab guide do-files as examples. See the Workflow book for further details.
  • 6.Showing and printing Stata output: When you reproduce the results from a Stata log file you must show it in a fixed font (e.g., Courier not Times Roman), in a small enough font size so that the lines do not wrap. Courier 9 often works well.

Grades and getting an A+

  • Overview: Grades are based on computer assignments, your research log, non-computer assignments, attendance, and in-class assignments. Each of these assignments is given a number of points, for a total of about 900 points. The grade for this work is determined by adding up the points and dividing by the total possible, then: 94-100% = A; 91-93 = A-; 80-83%=B+; 84-87%=B; 81-83%=B-; etc.
  • Mistakes and inconsistencies: If a mistake is made in grading, I apologize. Return the assignment to me along with a cover page explaining the error. If I do not return the assignment documenting the change within two class periods, please remind me by e-mail. Multiple people are doing the grading and we try to be consistent, but the is bound to be some variation (as in submitting a paper to a journal for review!).
  • Getting an A+: To get an A+ you must do a project as well as receive an A on the required assignments (a really great project might lead to an A+ if your other work is an A-). You must get Professor Long's approval for your project and meet with him periodically. The final project is due the first day of finals. A careful write-up of your results along with the supporting Stata files and research log are required. Here are some possible projects:
  • 1\ Find a published example of a regression model for categorical outcomes. Using the original data or similar data (e.g., something at ICPSR), reproduces all or part of the analysis. Show how the author obtained the results. Show what you would do to improve the analysis.

    2\ Estimate and interpret a substantively interesting and reasonable model in which you incorporate complexities on the RHS of the model (e.g., interactions, nonlinearities, group differences).

    3\ Compare a series of substantively interesting and reasonable nested models. Compare results across models using appropriate methods that deal with the identification issue.

    4\ Compare results from a substantively interesting model using both margins (partial changes or discrete changes) computed at the mean or some appropriate value with those computed by taking the mean of the margins over the sample. How do the results compare? Which do you prefer? Why?

    5\ Do the mathematical exercises in the Sage book.

    6\ Consider the Stata blog for August 22, 2011 by Bill Gould. Compare the two models using the income data from John Fox used in the class lecture on regression. Did Professor Long compute the predictions in the log-linear model correctly? Does -margins- do it correctly. How do the results compare to those using -poisson-

    7\ Make a proposal for something you find interesting.

Batch files - linking to the LAN and other things

  • A batch file to link to the class LAN: Click here.

Stata stuff

  • Getting Started with Stata: This document is great if you are unfamiliar with Stata or a bit rusty. Associated Stata files are here. 27Aug2011
  • CDA: Lab Guide for Stata: The guide is here. Associated do-files and exercises are here I strongly recommend that you work through the section of the guide corresponding to the current assignment before you start the assignment, even if you are sure you don't need to! 27Aug2011
  • Lecture Examples: The do-files, log files, and datasets for examples in the lectures are here as zipped files corresponding to each chapter. To save space, the graph files are not included but you can create them yourself by running the do-files. 27Aug2011

Datasets

You can use the following datasets for assignments. You can ftp them from here or in Stata load them with the spex command (e.g., spex gss9098extract). Codebooks are here.

  • gss9098extract.dta: General Social Survey 1990-1998.
  • hsb.dta: High School and Beyond Study 1983.
  • nes3.dta: 1992 National Election Study.
  • science3.dta: data on the careers of biochemists.
  • wls.dta: Wisconsin Longitudinal Survey data on Wisconsin high school graduates.
  • addhealth3.dta: Add Health data.

Getting help

If you need help debugging a program, the best thing is to place relevant files in your directory on the LAN in a subdirectory called \helpme (e.g., \jslong\helpme). Include the do-file, the dataset, and log file in text format, not smcl. Please follow the guidelines below or it is much less likely that you will get a quick and helpful answer. For further details on getting help, check here.

1) The do-file must be self-contained. It must load the data, create needed variables (if any), generate the problem, and save a log file in text format. The do-file must have comments explaining what you are doing and what the problem is.

2) If a SPost command is causing a problem, include the command which command-name for the command causing the problem. This tells me which version of the command you are using.

3) Do not refer to specific directories (e.g., do not: use d:\mydata\science3.dta). Assume that your data is located in your Stata working directory.

Here is an example of what the do-file might look:

capture log close
log using jslong_assgn1_problem, text replace


// Scott Long - 2011-08-31
// Assignment 4: binary regression
// ERROR: see #3 below.

// #1: load data and check data
spex science2, clear
tab y
sum x1 x2

// #2: estimate logit
logit y xl x2, nolog

// #3: compute discrete change
// ERROR: variable xl not found
which prchange
prchange, x(x1=1 x2=3)

log close
exit

Books

  1. ClassPak - be sure to bring this the first day of class. It includes lecture notes and reprints. Required.
  2. Long, J. Scott and Freese, Jeremy. 2005. Regression Models for Categorical Dependent Variables Using Stata, 2nd Edition. Stata Press: College Stata, TX. If you have the “revised” edition, you do not need to buy the 2nd edition. Recommended but not required.
  3. Long, J.S. 2008, The Workflow of Data Analysis Using Stata. Stata Press: College Station, TX. If you plan to do a lot of data analysis, this book will save you a lot of time and make your work replicable. Recommended but not required.
  4. Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Required and especially useful for those who are interested in mathematical details. Recommended but not required

Files to download

Most materials other than the course notes (available at TIS or the Campus Bookstore) can be downloaded here. Files will be added throughout the semester.

  • ftp site for CDAiu

Computing

If you want to install the ado files needed for this class, follow this link. You will also find sample programs and data sets at that location. While you may freely use my ado files, they require Stata to run. Stata is installed in campus computing labs. Personal copies can be purchased from the IU Stat/Math Center.

Enrolling in Soc 650/Stat 503 and Time Conflicts

Often there are more students who want to take S650 than there are seats in the class. First priority is given to graduate students for which this is required for their degree program. Otherwise, authorizations are given on a first-come-first-serve basis. If you are interested in taking the class, contact the graduate secretary in sociology to get on the list. The graduate secretary (socgrad@indiana.edu) will contact you regarding authorization for the class. If you are given an authorization, you need to sign up for the class during the normal enrollment period; if you do not, your authorization will be given to the next student on the wait list.

Time conflicts: If you have another class that overlaps with the lecture time, you will need to take the class another semester. If you have a time conflict with all of the lab times, you should take the class some other semester. If you can attend some of the labs each week and you are already familiar with Stata (or can learn it on your own), you will probably do fine, but might have to work harder. While most of the lab time is used for students doing independent work, the teaching assistant provide short lectures related to the assignments. For example, he/she might provide additional information about keeping a research log or how to format tables using Word.

Getting ready for Soc650/Stat503

There are several things you can do to get ready for the class.

  1. Review a book on the linear regression model.
  2. If you are rusty on mathematics, you can review the materials in this file.
  3. Feel free to start reading the main text, which is listed in the syllabus.
© 2010 J. Scott Long