Soc 751: Managing statistical research: the workflow of data analysis (Instructor Scott Long)
Intensive Summer Session I, May 12-June 5, 2015, 9AM-4PM weekdays
Managing Statistical Research teaches how to plan, organize, document, and execute sophisticated quantitative analyses regardless of the statistical methods used. The goal is to help you develop an workflow that allows you to work efficiently and accurately while creating reproducible statistical results. Topics include: 1) Planning your research. 2) Documenting your work. 3) Organizing, backing up, and archiving files. 4) Writing robust, effective programs for data analysis. 5) Using automation (basic programming methods) to work more accurately and efficiently. 6) Preparing data for analysis. 7) Systematically conducting statistical and graphical analyses. 8) Incorporating results into papers and presentations while maintaining their provenance. 9) Collaboration and data analysis. Lectures, exercises and applications are designed to help you develop a workflow for your own research.
The class assumes that you are planning to do quantitative data analysis and that you have completed at least one graduate class in statistics. Students starting their dissertation have found this class to be an ideal way to get their work organized, plan new analyses, and conduct analyses that are efficient and replicable. Students who are earlier in their graduate career develop a workflow that they will grow into as they undertake larger research projects. To complete exercises in the class you must have access to a dataset that you want to work with. Details are given below.
While Stata is used to illustrate some of the ideas, the strategies and concepts apply to any statistical package and you are welcome to use programs such as SAS or R for your work. To do this, you will need to have that package installed on your laptop and know how to use the software since the instructor might not. If you are planning to use a program other than Stata, please let me know which program by May 1.
In the past, the course has been over-enrolled. Admittance to the class requires authorization from the instructor (contact Scott Long firstname.lastname@example.org).
For more information on the workflow of data analysis, check here. If you need further encouragement to take the class, check here.
Downloads · Description · Books · Computing · Enrollment · Getting ready
- LAN access in Win connected to IU Secure: 1) Press WIN+e to open "This PC" window; 2) Click "Map network drive"; 3) Change drive to S; 4) Set the folder name to: \\bl-soc-theseus\s751$ (2015-04-28)
- Win off campus or on campus not connected to IU Secure: 1) Download and install Pulse Secure from iuware.iu.edu. It is in the "Network & Printing" software; 2) Select "Add connection" with the + button; 3) Connect to IUs Secure Sockets Layer (SSL) Virtual Private Network (VPN) using the instructions kb.iu.edu/d/aygt; 4) Follow steps in for IU Secure above; 5) Enter your user name and passphrase. These will be the same as your umail user name and passphrase. You likely will need to put "ads\" before your username. (2015-04-28)
- Mac connected to IU Secure: 1) In the Finder, choose Go > "Connect to Server."; 2) In the Server Address, type "smb://bl-soc-theseus/s751$"; 3) If you are prompted to provide your name and password, use your umail login information. (2015-04-28)
- Mac off campus or on campus without IU Secure: 1) Download and install Pulse Secure from iuware.iu.edu. It is in the Network Printing software; 2) Select "Add connection"; with the + button; 3) Connect to IUs Secure Sockets Layer (SSL) Virtual Private Network (VPN) using the instructions iuware.iu.edu/Mac/Title/2029; ; 4) Follow steps 1 to 4 for IU Secure above; 5) Enter your user name and passphrase. These will be the same as your umail user name and passphrase. You likely will need to put "ads/"; before your username. (2015-04-28)
- IRB approval: Please determine if you need IRB approval for using your data. For details, check here.
- Using Stata: If you do not have experience with Stata or would like a refresher, I suggest that you work through Getting Started Using Stata (auxillary files are located on the LAN) or take a look at the introductory videos on Stata's YouTube channel before the course begins. Even if you are planning to use another packages, the examples in class use Stata so you will benefit from knowing the basics.
- Using other statistical software. If you are planning to use a program other than Stata, please let me know which program by May 1.
- Accounts before class begins: Get an MDSS (same as SDA) account for mass storage. Details are here. Please apply immediately since delays in getting an account might occur.
- Books: Click here for ordering books.
- External drive: I encourage you to have an external drive to use for backup and testing methods for preserving data. I recommend a drive that does NOT require external power but rather gets power for the USB for Firewire port. Depending on the size of your dataset, a flash drive might work.
- Backing up your data: Make sure your files have been backed up before the class begins! Indeed, back them up twice.
- Setting up your computer: Before class begins, I encourage you to:
a) Install Dropbox or some other cloud service;
b) Link to MDSS from your computer;
c) Install a file manager and text editor (suggestions are here);
d) Use IUware to install Acrobat XI; to find it, Google "IUware Windows" or "IUware Mac";
e) Make sure your computer will connect to the IU wireless system.
- Class notes: A PDF of lecture notes will be made available. I no longer have the bookstore print copies, since so few people want printed copies.
- In 2015, the class meets daily from May 12 through June 2. Class begins at 9AM and you should be available until 4PM each day. There will be time during the day, possible one or two entire days, for working on your project. Afternoons include lab, but also some lectures and demonstrations. A semester's work is compressed into three weeks -- which is why some fondly refer to the class as "workflow boot camp." I have found that working more intensely over a shorter period of time is the best way to integrate these methods into your work.
Detailed description of class
This intensive class helps you develop a workflow for conducting complex statistical research. Workflow in data analysis is a framework for the entire research process: planning, organizing, and documenting your work; importing data; naming, labeling, documenting, creating, and verifying variables; conducting and presenting statistical and graphical analyses; and preserving your work. Each step is guided by the demands of producing replicable and accurate results while working as quickly and efficiently as possible. While traditional classes in statistics deal with estimating and interpreting models, in "real world" research statistical analyses often involve less than ten percent of the total work. This class focuses on the other ninety percent. Developing an efficient workflow saves time, improves accuracy, and leads to replicable results. We will explore the following topics.
1, General principles that guide your research: replicability, accuracy, and efficiency.
2. Efficient methods for planning, organizing, documenting, executing, and preserving your work.
3. Tools that enhance and simplify your work: software, programming methods, organizational structure, and infrastructure.
4. Real world examples of what works and what does not in each stage of the process.
- Planning and organizing research
- Preparing data for analysis: importing data; developing consistent names and labels; documenting the sample and variables; and cleaning the data.
- Conducting sophisticated data analysis that is replicable and efficient.
- Accurately and quickly incorporating complex statistical results into your writing and presentations while maintaining the provenance of each result.
- Methods to speed up the inevitable process of revising your work.
- How to prevent catastrophic loss of files during the project and to ensure long term preservation of your materials.
While many software tools are illustrated, Stata is use to illustrate an effective workflow, and the course will refer to my The Workflow of Data Analysis Using Stata.
Requirements: Each student is expected to develop her own workflow and apply it to a research project using real data (i.e., not simulations). The class is too short to complete your project, but you can develop critical skills to help you complete future research more quickly and accurately while ensuring replicability and maintaining the provenance of results. Students must participate in class, complete exercises applying the lectures to their project, develop a mock file structure with backup procedures, maintain a research diary, and present their "final" workflow to the class. A group project will help you develop skills for working in collaborations. While some lab time for independent work will be available during the day, students will also need to work on their project outside of class.
- REQUIRED: Lecture notes will be provided.
- REQUIRED: Long, J.S. 2008, The Workflow of Data Analysis Using Stata. Stata Press. The book is cheaper from Stata Corp than at amazon.com, but if you have free shipping, amazon.com might be cheaper. Some copies of this book will be at the book store.
- RECOMMENDED: Wong, Dona M. 2010. The Wall Street Journal Guide to Information Graphics: The Dos and Don'ts of Presenting Data, Facts, and Figures. Highly recommended for tables and graphs.
Notepad++: A highly recommended, freeware editor. Information on enhancing Notepad++ for work with Stata is here. Instructions for syntax highlighting in Stata.
Total Commander: My favorite file manager. muCommander is not nearly as good, but is free.
AutoHotkey: A freeware macro program.
TextWrangler: A highly recommended, freeware editor.
Path Finder: The recommended OSX file manager. muCommander is not nearly as good, but is freeware.
Enrolling in Soc 751
Authorization is required for enrollment. Contact Scott Long email@example.com.
There are several things you can do to get ready for the class.
1. Spend time working with Stata or your statistical package of choice.
2. Get materials ready for your project, such as datasets and codebooks. Begin planning your project and try loading your data in Stata.
3. Think about how you want to organize your files.
4. Gather all of the materials that you used for a quantitative research project. Try to replicate the results from the project.
5. Backup your files!