J. Scott Long - Indiana University
Departments of Sociology and Statistics
Home Teaching Research SPost Commands Workflow of Data Analysis Collecting Statistical Results PathView Parallel Coordinate Graphs Contact and vita Links Recommendations FTP downloads
Soc 751: Managing statisticsl research: the workflow of data anlaysis (Instructor S. Long)

Intensive Summer Session I, May 10-26, 2011, 9AM-5PM weekdays

 

The workflow of data analysis is not a class about a specific statistical technique. Instead, it is a class that teaches you how to plan, organize, document, and execute sophisticated quantitative analyses regardless of the statistical methods used. The goal is to help you develop an efficient workflow that allows you to work efficiently and produce results that are replicable. Topics include: 1) Planning your research. 2) Documenting your work. 3) Organizing, backing up, and archiving files. 4) Writing robust, effective do-files. 5) Using automation (basic programming methods) to work more accurately and efficiently. 6) Preparing data for analysis. 7) Systematically conducting statistical and graphical analyses. 8) Incorporating results into papers and presentations while maintaining their provenance. 9) Backing up your files. Lectures, exercises and applications are designed to help you develop your own workflow to efficiently conduct your research and to generate replicable results. Further information is given below.

For more information on the workflow of data analysis, check here. If you need further encouragment to take the class, check here.

Downloads · Schedule · Description · Books · Computing · Enrollment · Getting ready

News

  • Directories: I put examples of directory structucres in Dropbox (13May2011).
  • Software: (13May2011): For folder and fole comparison in Mac, these programs were recommended (DiffMerge).
  • help me! Download file with name me.sthlp then modify it. 11May2011
  • Information on mapping drives for WF class is here.11May2011
  • MacFlow by JD Smith for those of you using Mac OS X. 11May2011
  • Questoins about folders and directories? Maybe this will help...INCEPTION_FOLDER from chris baker. 10May2011
  • 1. IRB approval: If you do not already have IRB, go here and read carefully. Note that it says "Under Federal regulations and University policy, all researchers who conduct research that involves human subjects or materials of human origin must submit an application to the IRB. Approval of the research protocol must be in place BEFORE the research begins." If you are using secondary data, please verify with the Office of Research Administration (see link above) on whether you need approval. If you need it but do not obtain it, you cannot use the results of your research.17Apr2011
  • 2. Using Stata: If you do not have experience with Stata or would like a refresher, I suggest that you work through the "Getting Started Using Stata" before the course begins. If you need further additional help with Stata, please contact Shawna Rohrman (she was cc'd on e-mails to the class).17Apr2011
  • 3. Accounts before class begins: Get an MDSS (same as SDA) account for mass storage and a MyPage account. Details are here. Please apply immediately since delays in getting an account might occur. 17Apr2011
  • 4. Books: Click here for ordering books. 17Apr2011
  • 5. External drive: I encourage everyone to have an external drive to use for backup and testing methods for preserving data. You can buy these on sale for as little as $50. I recommend a drive that does NOT require external power but rather gets power for the USB for Firewire port. The smallest size drive (usually 320GB) is fine. 17Apr2011
  • 6. Setting up your computer: I believe that everyone indicated that they can bring a laptop to class. This will be very helpful. Before class begins, I encourage you to (17Apr2011):
    a) Install Dropbox
    b) Link to MDSS and your MyPage account from your computer
    c) Install a file manager and text editor (suggestions are here)
    d) Use IUware to install Acrobat X; to find it, Google "IUware Windows" or "IUware Mac".
    e) Having Stata installed on your computer is helpful but not required.
    f) Make sure your computer will connect to the IU wireless system.
  • 6. Dropbox: We will use Dropbox to exchange materials. Feel free to start using it now. 16Jan2010
  • 7. CLASSPAK: This should be available early May 2011. If the bookstore doesn't have copies of the ClassPak, go to the textbook register at the IU Bookstore and purchase a voucher. They will have a copy of the ClassPak by 3PM the next day. If you have problems, talk to Keith Waits or e-mail Kathy Parker cparker@indiana.edu. 17Apr2011
  • There might be one more seat available in the class. Tell a friend!.

Schedule

  • To be completed...

Detailed description of class

This intensive workshop helps you develop a workflow for conducting complex statistical research. Workflow in data analysis is the framework for the entire research process: planning, organizing, and documenting your work; importing data; naming, labeling, documenting, creating, and verifying variables; conducting and presenting statistical and graphical analyses; and preserving your work. Each step is guided by the demands of producing replicable and accurate results while needing to work as quickly and efficiently as possible. While traditional classes in statistics deal with estimating and interpreting models, in "real world" research statistical analyses often involve less than ten percent of the total work. This class focuses on the other ninety percent. Developing an efficient workflow saves time, improves accuracy, and leads to replicable results. This two-week workshop explores the following topics.

1, General principles that guide your research: replicability, accuracy, and efficiency.

2. Efficient methods for planning, organizing, documenting, executing, and preserving your work.

3. Tools that enhance and simplify your work: software, programming methods, organizational structure, and cyberinfrastructure.

4. Real world examples of what works and what does not in each stage of the process.

  • Planning and organizing research
  • Preparing data for analysis: importing data; developing consistent names and labels; documenting the sample and variables; and cleaning the data.
  • Conducting sophisticated data analysis that is replicable and efficient.
  • Accurately and quickly incorporating complex statistical results into your writing and presentations while maintaining the provenance of each result.
  • Methods to speed up the inevitable process of revising your work.
  • How to prevent catastrophic loss of files during the project and to ensure long term preservation of your materials.

While many software tools are illustrated, Stata is the primary package for data management and analysis and the course will use Long (2009) The Workflow of Data Analysis Using Stata.. While the methods apply readily to other statistical packages, students must have some familiarity with Stata. If you are not familiar with Stata, you can participate in the Stata tutorial offered by the Workshop in Methods, take the Stata netcourse NC101, or study an introduction to Stata (e.g., chapters 3 and 4 of my Workflow book. Students must have a blank, USB powered external drive (cost about $50).

Requirements: Each student is expected to develop her own workflow and apply it to her project. The class is too short to complete a project, but students will develop critical skills to help them complete future research more quickly and accurately while ensuring replicability and maintaining the provenance of results. Students must participate in class, complete exercises applying the lectures to their project, develop a mock file structure with backup procedures, create a simple research web page, maintain a research log, and present their "final" workflow to the class. While some lab time for independent work will be available during the day, students will also need to work on their project outside of class.

Books

  1. ClassPak - be sure to bring this the first day of class. It includes lecture notes and reprints. Required.
  2. Long, J.S. 2008, The Workflow of Data Analysis Using Stata. Stata Press.College Station, TX. The book is cheaper from Stata Corp than at amazon.com, but if you have free shipping, amazon.com might be cheaper. Some copies of this book will be at the book store.
  3. Wong, Dona M. 2010. The Wall Street Journal Guide to Information Graphics: The Dos and Don'ts of Presenting Data, Facts, and Figures. Highly recommended for tables and graphs.

Computing

Windows

Notepad++: A highly recommended, freeware editor. Information on enhancing Notepad++ for work with Stata is here.

muCommander: A very nice, freeware file manager. Highly recommended.

AutoHotkey: A freeware macro program.

Mac OSX

TextWrangler: A highly recommended, freeware editor.

muCommander: A very nice, freeware file manager. Highly recommended.

Enrolling in Soc 751

Preauthorization is required for enrollment. This is to make sure that all students have had sufficient quantitative experience to benefit from the course content and to make sure that each student has a significant project that they can use in the class. The class will substantial effort both in class and on work days when you will apply the methods shown in class.

Getting ready

There are several things you can do to get ready for the class.

1. Spend some time working with Stata.

2. Get the materials ready for your project, such as datasets and codebooks. Begin planning your project and try loading your data in Stata.

3. Think about how you want to organize your files.

4. Gather all of the materials that you used for a quantitative research project. Try to replicate the results from the project.

Why take the class?

Gabriel Rossman of UCLA recently wrote on his blog: “I recently read Scott Long’s new book The Workflow of Data Analysis Using Stata and I highly recommend it. One of the ironies of graduate education in the social sciences is that we spend quite a bit of time trying to explain things like standard error but largely ignore that on a modal day quantitative research is all about data management and programming. Although Long is too charitable to mention it, one of the reasons to emphasize these issues is that many of the notorious horror stories of quantitative research do not involve modeling but data management. ... By focusing on these largely neglected but critical data management issues, Long has done a service to the discipline. The publication of it may even reduce Indiana’s comparative advantage of producing hotshot quant PhDs now that grad students elsewhere can vicariously benefit from this important aspect of the training there.”

To apply for admission to this class, submit this application by February 15 and send it to jslong@indiana.edu.
© 2010 J. Scott Long