Y750 | 5508 | Dr. Joanne Peng

Course Comments 1. The objectives of the course are (1) to gain an understanding of some basic features of logistic regression modeling, and (2) to develop a coherent approach for evaluating the applications of logistic regression modeling in your field of study. 2. Instructions will consist of lectures, discussions, computer exercises, and student presentations. 3. All students will be expected to complete three projects. The projects are assigned by the instructor with an aim to facilitate student's understanding of various issues surrounding logistic regression. The third and final project will be an in-depth investigation of one or more issues about logistic regression which the student finds intriguing. For this third project, each student will be given an opportunity to orally present his/her design and findings in class. 4 .The primary prerequisite to this course is an equivalent of a second course in applied statistics which covered ordinary least squares regression models. An aptitude for mathematical analysis and SAS programming is beneficial. Deficiencies in particular areas will be remedied in tutorial sessions. 5.Required texts for the course are (a) SAS Institute's Logistic Regression: Examples using the SAS system (SAS Institute, 1995), and (b) Hosmer and Lemeshow's Applied Logistic Regression (John Wiley and Sons, 2001). Suggested references are (i) Kleinbaum's Logistic Regression (Springer, 1994), and (ii) Menard' Applied Logistic Regression Analysis (Sage Publication #106, 1992). Additional readings are found in a course packet (required) or will be distributed in class. Schedule Lecture/Topic/from text (b) 1 Introduction to logistic modeling/section 1.1 2 Fitting the simplest logistic regression model/Ch. 1 3 Fitting multiple logistic regression model/Ch. 2 4, 5, 6, 7 Interpreting coefficients of logistic regression/Ch. 3 models 8 & 9 Model building strategies and methods/Ch. 4 10 & 11 Assessing the fit of the model/Ch. 5 12, 13, 14 Logistic regression for matched case-control studies/Ch. 7 Advanced topic: Polytomous logistic regression/Ch. 8.1 or Choosing between logistic regression and discriminant analysis 15 Student presentations (12/10, 8:30 to 12:15 pm)/The 3rd (final) paper is due in class. Note: In order for you to devote sufficient time to the final project, it is best that the formal lectures/classes end on November 19th (a week before the Thanksgiving recess). Therefore, I'd like to suggest that we meet twice for the next two weeks, say on Monday or Tuesday in the evenings. This arrangement will ensure that we cover essential materials in time for you to carry out a meaningful final project. Grading System The final course grade will be a composite of grades assigned to three papers which students are expected to complete throughout the semester. Equal weights will be applied to the three paper grades in determining the composite. Instructions on writing each paper as well as criteria in determining paper grades will be announced in class. Incompletes will be given only for a legitimate reason as outlined in the university's Academic Guide, and only after a conference between the instructor and the student. Throughout the course of this section, you may contest every grade awarded to your papers or the overall course performance within 48 hours of receiving such a grade. Once this "statute of limitation" has passed, it is assumed that you willingly accept the grade(s) assigned without further dispute. Labor Sharing System Each participant is encouraged to contribute to the overall learning of the class by taking up responsibilities for 3-4 tasks throughout the semester. These tasks may include (1) solving an assigned problem from the text, (2) performing computer analysis of data, (3) leading discussions of assigned articles, (4) surfing the internet for information or resources on logistic regression, etc. The sharing of these responsibilities is on a voluntary basis. Academic Honesty and Intellectual Integrity According to P.72 of the Academic Handbook (June 1992 edition), each faculty member has "a responsibility to foster the intellectual honesty as well as the intellectual development of his/her students." In order to achieve these goals, each student enrolled in this course is prohibited from engaging in any form of "cheating" or "plagiarism." Cheating is defined as and "dishonesty of any kind with respect to examination, course assignments, alteration of records, or illegal possession of examinations" (p. 72 of the Academic Handbook). "It is the responsibility of the student not only to abstain from cheating but, in addition, to avoid the appearance of cheating and to guard against making it possible for others to cheat. Any student who helps another student to cheat is as guilty of cheating as the student he or she assists. The student also should do everything possible to induce respect for the examining process and for honesty in the performance of assigned tasks in or out of class." (p. 72 of Academic Handbook). Plagiarism is defined as "offering the work of someone else as one's own" (p. 72 of Academic Handbook). "The language or ideas thus taken from another may range from isolated formulas, sentences, or paragraphs to entire articles copied from books, periodicals, speeches, or the writings of other students. The offering of materials assembled or collected by others in the form of projects or collections without acknowledgment also is considered plagiarism. Any student who fails to give credit for ideas or materials taken from another source is guilty of plagiarism." (p.72 of Academic Handbook). Evidence of student academic misconduct will result in (a) a lowered course grade, (b) transfer out of this course, (c) dismissal from student's academic unit, or (d) other disciplinary actions in accordance with the guidelines outlined on p.73 of Academic Handbook. Example #1 of First Take-Home Project in Y750 for XXXXX (SS#) Deadline: XXX, 2001 PROJECT EXERCISES: 1. Given the data below, fit a logistic regression model so that a county's ability to compete for at least one construction grant is predicted from its share of the state population, population density, population change (in percents), median household income, its share of state minorities, and percent of residents living in poverty. 2. Discuss the overall significance of the model, significance of each predictor (or covariate), and the interpretation of any design variable contained in the model. 3. Can you improve upon this full regression model by removing any predictor that is not significantly related to a county's record of receiving at least one construction grant? If so, what is the new, more efficient regression model? 4. Compare the two models obtained from (1) and (3) above in terms of the overall fit, the significance of predictors, the interpretability of the model, the predictive power of each model, etc. DATA DESCRIPTION: The data set is composed of two data sources: one is the county-wide demographic data and the other is the Consolidated Federal Funds Report (Bureau of the Census, 1993). Populations under study included all counties in the 48 contiguous states during the period 1983-1992 (N=3110). The objective of the data collected during this period was to determine whether disadvantaged community has equal access to environmental infrastructure as other communities do. Disadvantaged communities include small, rural, low income, and minority counties, though rural counties will not be distinguished from urban counties for the purpose of this first take-home project. Environmental infrastructure under examination is the Construction Grants Program administered pursuant to the Clean Water Act. The environmental justice literature suggests that the county share of state population, percent population change, population density, county share of state minorities, median household income are important explanatory variables. These will be used to account for and help identify those counties which received at least one grant during the period of 1983-1992 (a total of 9,854 grants awarded.) CODING SHEET FOR VARIABLES: Dependent variable: whether a county received at least one construction grant during the period mentioned above. Independent variables: county share of state population, percent population change, population density, county share of state minorities, and median household income. Variable names Definitions and codings GAWARD Grant received (0= zero grant received, 1=at least one grant received.) AMTTOT Total amount of grants awarded (in dollars) POP_RAT County share of state population DEN92 Population density (population/square miles) PC_CHAN Population change INCOME Median household income (in dollars) MIN_RAPC_POPPV Percent of population in poverty (%) REFERENCE: Bureau of the Census, U.S. Department of Commerce (1993). Consolidated Federal Funds Report (CFFR) on ROM, Fiscal Year 1983-1992. Washington, DC: Bureau of Census, U. S. Department of Commerce. Example #2 of First Take-Home Project in Y750 for XXXXX (SS#) Deadline: XXX, 2001 PROJECT EXERCISES: 1. Given the data below, fit a logistic regression model so that a company's overall image is predicted from its promotion of current products, promotion of medications under development, quality of sales representatives, sponsorship of scientific/clinical symposia and educational programs, quality of corporate hospitality and entertainment, quality of its exhibition presence, and quality of its promotional items. 2. Discuss the overall significance of the model, significance of each predictor (or covariate), and the interpretation of any design variable contained in the model. 3. Can you improve upon this full regression model by removing any predictor that is not significantly related to a company's overall image? If so, what is the new, more efficient regression 4. Compare the two models obtained from (1) and (3) above in terms of the overall fit, the significance of predictors, the interpretability of the model, the predictive power of each model, etc. DATA DESCRIPTION: A number of research reported in the past that an organization's image was thought to play an important role in the organization's financial performance. In this context, corporate image refers to an established reputation, symbol, or market presence that contribute to a business's changing financial power. Many pharmaceutical companies spend millions of dollars each year promoting their companies image and products to physicians in convention settings. These conventions are important to pharmaceutical companies since they provide access to physicians who are otherwise difficult to communicate with or inaccessible outside of conventions. Understanding physicians' perceptions of image and the factors that influence these perceptions can enable companies to optimize promotional strategies, and thus increase corporate image and performance. The purpose of this study is to identify the antecedents of pharmaceutical corporate image in a convention setting, and to estimate the effects of these antecedents on the overall image of a company. Sample : 200 high prescribing psychiatrists attending the 149th Annual American Psychiatric Association Convention. These psychiatrists are identified as doctors who usually prescribe medicine as a part of their treatment. They usually have various options from which to choose in their prescription and influence companies' performance through prescription activities that may be influenced by company's image. CODING SHEET FOR VARIABLES: Dependent Variable: A company's overall image (0= bad, 1= good) Independent Variable: 1. promotion of current products (0= bad, 1= good) 2. promotion of medications under development (0= bad, 1= good) 3. sales representatives (0= bad, 1= good) 4. sponsorship of scientific/clinical symposia and educational program (0= bad, 1= good) 5. corporate hospitality and entertainment (0= bad, 1= good) 6. exhibition presence (0= bad, 1= good) 7. promotional items (0= bad, 1= good) Feedback on the first project paper Y750, Spring 2001 Instructor: Joanne Peng 1. The letter grades on 7 papers ranged from XXX to XXX. The grades were assigned based on three criteria: a. accuracy of results b. completeness of answers, and c. writing style of the paper. Feedback from the second reader, as I discovered, was very thorough and insightful. I wish to publicly thank our second readers in our midst: Dr. Jack Schmit, Dr. Deborah Carter, Mika Omori, Shimon Sarraf, and Harry So (Math/Stat. center consultant). 2. You all demonstrated more than adequate level of knowledge about LR modeling and any computer programming necessary to implement this model in a data set. 3. Due to the data structure (which is beyond your control), some had more exciting modeling processes/expressions to report than others. This factor however did not impact your grade in any way or form. 4. Below are some observations of my own which would have improved your first paper: * descriptive analyses of covariates and their relationships with the outcome variables should have been included. All papers except for two presented any information which described the data set on all variables involved. * The event category of the outcome variable should be correctly and clearly identified. Only one paper discussed this. You cannot assume that SAS or SPSS knows the category of DV that you tried to model using LR. *coding/recoding/dummy coding of all covariates should be thoroughly and clearly discussed and presented. * When dummy coding, such as 1 for yes, and 2 for no, was in the original data set, you need to make a judgment as to how reasonable this is for explaining the outcome of the event. Maybe it is better to recode such a variable into (1 for yes, and 0 for no.) * SC and AIC need to be included, especially when you are comparing models. * The difference in -2LL between competing models (full vs. simple) needs to be presented and discussed as evidence to support your choice of the final model. * When chi-square test is already significant, say at .001 level, there is no need to compare the magnitude of such a test statistic, since each is associated with a difference df. The same can be said about the chi-square test of slope coefficients, unless they are associated with the same variable or variables of comparable psychometric scales. * A bit of introduction (or background information) and review of relevant literature should help. * If a pre-designated alpha level is used to determine the significance of various tests, then such a level should be made known from the onset. * When a covariate is forced into a model for practical significance, you need to tell your readers why this is the case and what you mean by "practical" significance. * Interpretations of the output and results should include the model you finally chose, the meanings of slope coefficients, the predictive power of such a model, any evidence to support your choice of this final model over other competing model(s). Example #1 of Second Take-Home Project in Y750 for XXXXXXX Deadline: XXXX, 2001 PROJECT EXERCISES: 1. Given the data below, fit a full logistic regression model so that an oak tree's sprouting presence, one year following harvesting, is predicted from (1) the tree species, (2) the diameter of tree measured (dbh), (3) the age of the tree at harvesting, (4) the site index and any meaningful two-way interaction among these covariates. This step should include properly scaled transformations of continuous covariates and identification of "meaningful" interaction terms. 2. Acess the fit of the full model in terms of its goodness of fit statistics and ill-fitted data points. 3. Improve upon the full model by model-building strategies of your choice which are implemented in either SPSS, SAS, BMDP, or Minitab. 4. Interpret the reduced model derived from (3) above in terms of odds-ratio associated with subgroups, any covariate acting either as an effect-modifier or confounder, suggestion for removing certain data points from data in order to better fit the model, goodness of fit statistic, and diagnostic statistic. etc. DATA DESCRIPTION: The data set contains information collected from 9 timber stands located in southern Indiana on the Hoosier National Forest. 2184 oak trees were measured. Pre-harvest field work began in June of 1987 and was completed in November 1987. At that time, species identification, dbh, and site index were obtained. Immediately following harvest, age was obtained. One year following harvest, sprouting presence was obtained. The harvesting was completed over a two year period, so roughly half the stands were measured one year and the other half measured the following year; however, the sprouting presence was noted one year following the harvest. The data were collected in order to assist in researchers' ability to model and estimate the contribution of oak stump sprouts to future stand stocking. Dependent variable: sprouts Independent variables: species, dbh, age, and site index. Variable Names definitions and coding TREE_NOID number SPECIESSpecies of each tree 1=white oak 3=black oak 4=scarlet oak 5=red oak 18=chestnut oak DBH Diameter of tree measured 4.5 feet above ground level (in cm). AGE Age of tree at time of harvest approximately 1 foot above ground level (years). SI Site index, an indication of site quality, the higher the number the better the site. It is the height that a tree is expected to attain in 50 years. SPROUTS Presence of sprouts 1 year after harvest. 0=absent 1=sprout or sprouts present Example #2 of Second Take-Home Project in Y750 for XXXXXXX Deadline: XXXX, 2001 PROJECT EXERCISES: 1. Given the data below, fit a logistic regression model so that an IU undergraduate student's decision to be enrolled at IU is predicted from the remaining 8 variables, including students' demographic, ability, IU contacts prior to admissions and any meaningful two-way interaction among these covariates. This step should include properly scaled transformations of continuous covariates and identification of "meaningful" interaction terms. 2. Access the fit of the full model in terms of its goodness of fit statistics and ill-fitted data points. 3. Improve upon the full model by model-building strategies of your choice which are implemented in either SPSS, SAS, BMDP, or Minitab. 4. Interpret the reduced model derived from (3) above in terms of odds-ratio associated with subgroups, any covariate acting either as an effect-modifier or confounder, suggestion for removing certain data points from data in order to better fit the model, goodness of fit statistic, and diagnostic statistic. etc. Outcome Variable: Enrollments of non-resident IU students who started their programs in the fall of 1998. Explanatory Variables: Demographic Indicators Ability Score: continuous, T score weighted 50% on both HS rank and ACT Composite Score Gender: dichotomous Ethnicity: white (referent), black, Hispanic, other Contact Indicators Interview: dichotomous, interview with IU alum or admissions counselor Preview: dichotomous, received Preview document HS Visit: dichotomous SIEmail: dichotomous, student initiate email Mail: dichotomous, received IU mailing Feedback on the second project paper Y750, Spring 2001 Instructor: Joanne Peng 1. The letter grades on 6 papers ranged from XXX to XXX. The grades were assigned based on three criteria: d. accuracy of results/interpretations, e. completeness of answers, and f. writing style of the paper. Feedback from the second reader, as I discovered, was very thorough and insightful. I wish to publicly thank our second readers in our midst: Dr. Jack Schmit, Mika Omori, Shimon Sarraf, and Harry So (Math/Stat. center consultant). 2. You clearly demonstrated excellent knowledge about LR model building strategies and computer programming skills necessary to complete the task on a real-world data set. 3. Due to data structure (which is beyond your control), some had more complex steps in model building process to report than others. This factor however did not impact your grade in any way or form. 4. Below are some observations of my own which would have improved your second paper: * Continuous variables are best left as continuous, unless you have strong reasons to believe that a scale transformation such as log, square, quartiles, dichotomy, etc., is needed. * If a scale transformation was performed, you need to report the rationale, or empirical evidence as to why certain type of transformation is needed or desired, over other types. * P-levels for each stage of model building need to be clearly stated. In class some weeks ago, I outlined an action plan for model building steps. In that plan, I recommended different p-levels for different stages in the process. For example, 10% level is considered acceptable when each covariate in a multivariate model is evaluated. Please review various criteria for p-levels when writing up a report such as your second project paper. * Need to clearly state what goodness-of-fit statistics were examined and what criteria were used to judge/interpret those values. * As model building progressed, you undoubtedly were saddled with many possibilities. When these possibilities are presented in print, it will be very helpful to give a code to each model such as (A), (B), (C), etc, or (I), (II), (III), etc. * A table comparing all possible models will be helpful. * Inclusion of ROC curve and the graph of sensitivity against (1- specificity) is desirable. * The use of individual observations or covariate patterns in diagnostic statistic should be made clear. * Cut-off points used in evaluating the diagnostic values should be made clear. Interpretation of the chosen model should include: (a) the mathematical functional form of the model, (b) the significance of + or – sign of each covariate, (c) odds ratio, (d) certain profiles of cases, such as sex=male, age=17, SES=low, etc. and their predicted logits or probabilities. Possible Topics For The Final Paper Choosing between logistic regression and discriminant analysis ref included— a. Flury's chapter on LR and DA b. Efron (1975). The efficiency of logistic regression compared to normal discriminant analysis, JASA, 70, 892-898. c. Press and Wilson. (1978). Choosing between logistic regression and discriminant analysis. JASA, 73, 699-705. * Comparison of loglinear modeling and logistic regression * Detecting differential item functioning (DIF) using logistic regression procedures * The use and evaluation of polytomous logistic regression models –start from Chapter 8 of Hosmer and Lemeshow. * Comparison of logit model and probit model * The use of logistic regression reported in two referred journals of your field of interest from 1988-98. Read the guidelines as follows: Outlines of Review of Papers Using LR in Referred Journals of Your Field [in c:\u\y750\projects\review.outline.doc, 3/19/99] Introduction—reasons for the review *There is a increasing trend of using LR in XXX articles due to a. complex data structure b. massive amount of data c. dichotomous outcomes such as persist vs. non-persist, drop-out vs. stayer, admit vs. no-admit, private vs. public, etc. * helpful review articles such as the one by Cabrera (1994, #2 in the reading packet) and another by Austin and Hinkle (1992, title "LR in Higher Ed."). * increasing use of LR in AERA annual programs, e.g. Division J (?) on higher ed. 1. Journals, period (1988-1998), articles reviewed 2. research questions addressed - link this section to (0. Introduction) make the following issues concrete using the articles reviewed: a. complex data structure—mixture of categorical variables and continuous variables. b. massive amount of data c. dichotomous outcomes such as persist vs. non-persist, drop-out vs. stayer, admit vs. no-admit, private vs. public, etc. 3. types of modeling (mentioned in analysis section?) forced entry, sequential (not stepwise, more like block), interaction terms (not hierarchical models), stepwise 4. issues to be concerned in designs a. over-sampling (Ott, #12) - solution, Recommendation (in English) b. small sample size for ML estimators, relative to no. of covariates (Wilson and Hardgrove, 1995 article, #21) c. extremely small or large p (baseline) in the data - results in (a) no diagnostic, (b) needs to overlap between 2 categories, this can be ok if dealing with large sample size. d. over-dispersion problem (i.e., the proportion of event outcomes is not even across clusters, households, families, neighborhoods is not even. Thus, the baseline p varies widely across these clusters.) Solution: (a) subset analysis, (b) include higher order terms, and (c) SAS correction formula. 5. issues to be concerned in analysis (LR) and reporting/presentation of results: (1) which packages used (SAS, SPSS…) (2) dummy coding, input formats.. (3) models used (forced entry, sequential, stepwise, hierarchical) (4) goodness of fit statistics (lacking or incorrect)… Inferential a. score test b. chi-square (overall) c. chi-square (beta's) d. chi-square (H/L) e. Brown statistics Descriptive a. AIC b. SC c. R square and rescaled R squared d. concordant pairs e. discordant pairs f. % of correct classification g. sensitivity h. sphericity i. false positive j. false negative k. Somer's D l. Gamma m. c-statistics (5) interaction and confounding between a categorical variable and a continuous. (6)Interpretation: [deviance and (Stage) mixed logit with prob.] a. significant coefficient (delta p..) b. prediction (Stage): logit vs. probability, cutoff points specified c. Odds ratio --Odds ratio = exp(slop coefficient). In terms of the meaning of these odd ratios, take the age variable for example. Hence, holding all other variable values constant, the odd ratio means the change in the odds (or p/1-p) of y for each unit change in its corresponding X. The relationship between p and odds, though positively related but, are not linearly related obviously. Three conditions must be met before odds ratios can be interpreted: (a) the explanatory variable does not interact with any other variable. (b) the explanatory variable is represented by a single term in the model. (c) a one-unit change in the explanatory variable is meaningful and relevant. SAS options can help you realize the change in odds with 10 or 100 units change in X and also the 95% CI on the odds ratios using either the likelihood ratio method or the Wald method. These can be obtained from SAS using appropriate keywords referred to my solution printout to the ICU.SAS exercises from Ch.2) d. Kappa = (p-pc)/(1-pc)] where pc = the baseline p 7. Diagnostic statistics calculated, performed and actions taken.