J. Scott Long - Indiana University
Department of Sociology | Department of Statistics
Home Teaching Research SPost Commands Workflow of Data AnalysisContact and vita Links Recommendations FTP downloads
Categorical Data Analysis at IU: Stat 503 and Soc 650

Fall 2012: August 21, 2012-December 14, 2012

Stat503 / Soc650 is a second course in in applied statistics. The prerequisite is a course in regression models for continuous dependent variables such as Soc 554 or Stat 501. Categorical Data Analysis deals with regression models in which the dependent variable is categorical: binary, nominal, ordinal, and count. Models that are discussed include probit and logit for binary outcomes, ordered logit and ordered probit for ordinal outcomes, multinomial logit for nominal outcomes, and Poisson regression and related models for count outcomes. This web page serves as the syllabus for the class.

Assignment due dates · Assignments · Workflow requirements · Codebooks ·

Getting Help ·Stata stuff · Grades and getting an A+ · LAN and batch files · Computing · Books ·

Enrolling & time conflicts · FTP downloads ·

News and policies

  • Assignment 9: ORM. Due Nov 29.
  • Logistics: Class meets 1:00 - 2:15 TR in BH244. Please arrive on time and be ready to start at 1:00. Labs are held in SB230 2:30-4:30 TR and SB221 5-7 TR. My office is Ballantine 842B which is directly across from the elevator. Enter 842 (no need to knock) and my office is at the end of the hall. If I'm talking with someone, please let me know you are waiting. Office hours will be set the first week of the semester. Feel free to contact me by e-mail; during the week if you don't hear within 12 hours, try again. 2012-08-17
  • CLASSPAK: If the bookstore doesn't have copies of the ClassPak, go to the textbook register at the IU Bookstore and purchase a voucher. They will have a copy of the ClassPak by 3PM the next day. If you have problems, e-mail Kathy Parker cparker@indiana.edu. 2012-08-17
  • Turning in assignments: Assignments are due 30 minutes before the end of your lab on the due date. Pedagogically it is critical to complete assignments on time. Late assignments are penalized 25%. If there are special circumstances, get e-mail approval from Professor Long; turn in the e-mail approval along with your late assignment. 2012-08-17
  • LAN access: Students in the registrar's enrollment list will be given R/W permission to the course LAN. If you enrolled late, contact sochelp at indiana dot edu (cc jslong at indiana dot edu) and request R/W access to the LAN for Stat 503/Soc 650. Contact the TA for you lab for further help. The batch file to access the LAN is here. 2012-08-17
  • Teaching Assistants: The AIs are Mike Vasseur (mvasseur at indiana dot edu), Tom VanHeuvelen (tvanheuv at indiana dot edu), and Nichole Bauer (nmbauer at umail iu edu).2012-08-17
  • Computing labs: You have signed up for one of two labs sections. Each section meets twice a week for two hours. The lab instructors might provide a short presentation or discuss the assignments at the start of each lab. Teaching assistants will be available for 90 minutes each day, but might not be available the last 30 minutes of lab. 2012-08-17
  • SugarSync and Dropbox: Dropbox is a handy cloud utility for saving and sharing files. While Dropbox is easier for some things, I find SugarSync to be more powerful and easier for many things; it is less expensive if you want to buy extra storage. Indiana University now provides free storage with Box. 2012-08-17
  • IU students can run Stata for free over the internet. For details go here. Opinions vary on how well this works. 2012-08-17

Due dates - assignments are due 30 minutes before lab ends

Assignments in the form of Word files will be available here. Add your answers to the assignment file and place the renamed, completed file in your LAN directory. Details are provided in lab.

  • Assignment 1: Math review. Due day 3, Aug 28.
  • Assignment 2: Picking your variables. Due day 4, Aug 30.
  • Assignment 3: Data cleaning. Due day 6, Sep 6.
  • Assignment 4: LRM. Due day 9, Sep 18. (corrected date)
  • Assignment 5: BRM-1. Due day 11, Sep 25.
  • Assignment 6: BRM-2. Due day 15, Oct 9. (CHANGED!)
  • Assignment 7: Testing and Fit. Due day 17, Oct 16.
  • Assignment 8: MNLM. Part 1: Due day 22, Nov 1. Part 2: Due day Nov 13.
  • Assignment 9: ORM. Due Nov 29.
  • Assignment 10: Count models. Due 5pm on T of exam week, Dec 12.

Workflow requirements for assignments

An essential part of being an effective researcher is a workflow that allows you to organize your efforts and later replicate the results you have already completed. Since this class is an applied course in data analysis, a portion of your grade is based on the workflow you use in completing your assignments. More general information and a detailed treatment of workflow is available at Long’s workflow page and his book The Workflow of Data Analysis Using Stata. For this class you are not required to implement the full workflow from the book, but you are encouraged to improve your workflow as your time and interest allow. Early in the fall Long will present an overview of workflow in a Workshop in Methods presentation. Here is additional information.

Requirements for Stat 503/Soc 650

  • 1. Keep a research log: A research log is a record of your progress similar to a journal. In you research log you should record your work on each assignment. An example is posted to the course LAN to show you what you log might look like. (Note that the research log is distinct from Stata log file that you create with log using dofilename, text replace)
  • 2. Have an organized file structure: Organizing files so you can find your work and know what has been finished is essential for an effective workflow. The file structure you are to use in this class can be created using that batch file cda_workflow.bat. Copy this file to your course folder or directory and run it to create the directories for the course. Three types of directories are created: work directory, support directory, and assignments directory. The work directory is where your current work goes (e.g., assignments you have not finished). The support directory contains examples or other files that are not critical to completing your assignment but support your work. You are not graded on the organization of your support directory. Your assignments directory holds completed assignments. Your work on an assignment should be moved to this directory when it is completed (see #3 below).
  • 3. “Post” files that are done (and do not change them!): A fundamental task of workflow is keeping track of which work is finished and never changing completed work. You can do new work, but you cannot change work that is completed. Facilitating this is a process of “posting” files. When work is completed it should be moved from your work directory and posted to the appropriate assignment directory. Posted files should never be changed. If you need to change something that has been posted, create a new do-file that creates a new log file (see #4 below). Before you turn in an assignment, all associated files must be posted.
  • 4.Follow file naming conventions: Cheap storage makes accumulating a massive number of files too easy. To keep track of these files you must use a standardized names. Files names should use the following formation where you replace the bracketed values with the appropriate information:
            [your initials]-503-a[assignment #]-[step #]-[task].[extension]
    For example, my work for categorical data analysis might include:
            mrv-503-a2-01-create-variables.log
    This is the log file from Mike R. Vasseur (mrv), for Stat 503 (503), assignment 2 (a2), the first program for the assignment (01), in which you are creating variables (create-variables). Or,
            mrv-503-a4-02-lrm-analysis.do
    is the second do-file for assignment 4 where Mike is running a LRM. No two files should have the same name. If you post a file and need to make a change, the revised file should add a version number (e.g., mrv-503-a4-02v2-lrm-analysis.do) This way you have a full record of what you have done in the past and need not guess as to which file was the last one your used.
  • 5. Use legible and robust do-files: You should be able to run do-files on another computer without making changes. At a minimum this means that you should create a separate profile.do to avoid directly coding directory structures in your do-files. Your do-files need to be clear and easy to understand. Look at the examples on the LAN and model your do-files after those. Or use the examples from the lecture do-files or the lab guide do-files as examples. See the Workflow book for further details.
  • 6.Showing and printing Stata output: When you reproduce the results from a Stata log file you must show it in a fixed font (e.g., Courier not Times Roman), in a small enough font size so that the lines do not wrap. Courier 9 often works well.

Grades and getting an A+

  • Overview: Grades are based on computer assignments, your research log, non-computer assignments, attendance, and in-class assignments. Each of these assignments is given a number of points, for a total of about 900 points. The grade for this work is determined by adding up the points and dividing by the total possible, then: A=100%-94%; A-=93%-91%; B+=90-88%=B+; B=87%-84%=B; B-=83%-81%; etc.
  • Mistakes and inconsistencies: If a mistake is made in grading, I apologize. Return the assignment to me along with a cover page explaining the error. If I do not return the assignment documenting the change within two class periods, please remind me by e-mail. Multiple people are doing the grading and we try to be consistent, but the is bound to be some variation (as in submitting a paper to a journal for review!).
  • Getting an A+: To get an A+ you must do a project as well as receive an A on the required assignments (a really great project might lead to an A+ if your other work is an A-). You must get Professor Long's approval for your project and meet with him periodically. The final project is due the first day of finals. A careful write-up of your results along with the supporting Stata files and research log are required. Here are some possible projects:
  • 1\ Find a published example of a regression model for categorical outcomes. Using the original data or similar data (e.g., something at ICPSR), reproduce all or part of the analysis. Show how the author obtained the results. Show what you would do to improve the analysis.

    2\ Estimate and interpret a substantively interesting and reasonable model in which you incorporate complexities on the RHS of the model (e.g., interactions, nonlinearities, group differences).

    3\ Compare a series of substantively interesting and reasonable nested models. Compare results across models using appropriate methods that deal with the identification issue.

    4\ Compare results from a substantively interesting model using both margins (partial changes or discrete changes) computed at the mean or some appropriate value with those computed by taking the mean of the margins over the sample. How do the results compare? Which do you prefer? Why?

    5\ Do the mathematical exercises in the Sage book.

    6\ Reproduce your results from the assignments by using the -margins- comand and factor variables.

    7\ Make a proposal for something you find interesting.

Batch files - linking to the LAN and other things

  • A batch file to link to the class LAN: Click here.

Stata stuff

  • Downloading files: to download files, in Stata, connected to the internet, type: findit cda. Click the link to what you need.
  • Getting Started with Stata: This document is great if you are unfamiliar with Stata or a bit rusty.
  • CDA: Lab Guide for Stata: The guide is here. I strongly recommend that you work through the section of the guide corresponding to the current assignment before you start the assignment, even if you are sure you don't need to!
  • Lecture Examples: The do-files, log files, and datasets for examples in the lectures are available in the cdalectures package.

Datasets

You can use the following datasets for assignments. You can ftp them from here or in Stata load them with the spex command (e.g., spex gss9098extract). Codebooks are here or can be downloaded with the cdacodebooks package.

  • gss9098extract.dta: General Social Survey 1990-1998.
  • hsb.dta: High School and Beyond Study 1983.
  • nes3.dta: 1992 National Election Study.
  • science3.dta: data on the careers of biochemists.
  • wls.dta: Wisconsin Longitudinal Survey data on Wisconsin high school graduates.
  • addhealth3.dta: Add Health data.

Getting help

If you need help debugging a program, the best thing is to place relevant files in your directory on the LAN in a subdirectory called \helpme (e.g., \jslong\helpme). Include the do-file, the dataset, and log file in text format, not smcl. Please follow the guidelines below or it is much less likely that you will get a quick and helpful answer. For further details on getting help, check here.

1) The do-file must be self-contained. It must load the data, create needed variables (if any), generate the problem, and save a log file in text format. The do-file must have comments explaining what you are doing and what the problem is.

2) If a SPost command is causing a problem, include the command which command-name for the command causing the problem. This tells me which version of the command you are using.

3) Do not refer to specific directories (e.g., do not: use d:\mydata\science3.dta). Assume that your data is located in your Stata working directory.

Here is an example of what the do-file might look:

capture log close
log using jslong_assgn1_problem, text replace


// Scott Long - 2011-08-31
// Assignment 4: binary regression
// ERROR: see #3 below.

// #1: load data and check data
spex science2, clear
tab y
sum x1 x2

// #2: estimate logit
logit y xl x2, nolog

// #3: compute discrete change
// ERROR: variable xl not found
which prchange
prchange, x(x1=1 x2=3)

log close
exit

Books

  1. ClassPak - be sure to bring this the first day of class. It includes lecture notes and reprints. Required.
  2. Long, J. Scott and Freese, Jeremy. 2005. Regression Models for Categorical Dependent Variables Using Stata, 2nd Edition. Stata Press: College Stata, TX. If you have the “revised” edition, you do not need to buy the 2nd edition. Recommended but not required.
  3. Long, J.S. 2008, The Workflow of Data Analysis Using Stata. Stata Press: College Station, TX. If you plan to do a lot of data analysis, this book will save you a lot of time and make your work replicable. Recommended but not required.
  4. Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Required and especially useful for those who are interested in mathematical details. Recommended but not required

Files to download

Most materials other than the course notes (available at TIS or the Campus Bookstore) can be downloaded here. Files will be added throughout the semester.

  • ftp site for CDAiu

Computing

If you want to install the ado files needed for this class, follow this link. You will also find sample programs and data sets at that location. While you may freely use my ado files, they require Stata to run. Stata is installed in campus computing labs. Personal copies can be purchased from the IU Stat/Math Center.

Enrolling in Soc 650/Stat 503 and Time Conflicts

Often there are more students who want to take S650 than there are seats in the class. First priority is given to graduate students for which this is required for their degree program. Otherwise, authorizations are given on a first-come-first-serve basis. If you are interested in taking the class, contact the graduate secretary in sociology to get on the list. The graduate secretary (socgrad@indiana.edu) will contact you regarding authorization for the class. If you are given an authorization, you need to sign up for the class during the normal enrollment period; if you do not, your authorization will be given to the next student on the wait list.

Time conflicts: If you have another class that overlaps with the lecture time, you will need to take the class another semester. If you have a time conflict with all of the lab times, you should take the class some other semester. If you can attend some of the labs each week and you are already familiar with Stata (or can learn it on your own), you will probably do fine, but might have to work harder. While most of the lab time is used for students doing independent work, the teaching assistant provide short lectures related to the assignments. For example, he/she might provide additional information about keeping a research log or how to format tables using Word.

Getting ready for Soc650/Stat503

There are several things you can do to get ready for the class.

  1. Review a book on the linear regression model.
  2. If you are rusty on mathematics, you can review the materials in this file.
  3. Feel free to start reading the books listed above.
© 2013 J. Scott Long