During my first couple of weeks working in the Research, Evaluation, and Assessment department of a school district, I discovered the huge pressure on educators and school district leaders to use data. “Can you tell me what percentage of African American male students were proficient in reading over the last four years by school?,” “How about that same thing but also by free/reduced price lunch?” These questions were endless. I was regularly bombarded by data requests from multiple people from multiple departments, as well as people external to the school district, and everyone wanted their data yesterday.

My experience in the school district was a far cry from academia, where I had the advantage of working on a single research project at a time, the datasets I dealt with were relatively small, and the only people asking questions were the members of the research team. I would soon discover that academia did a poor job of preparing me to quickly respond to data requests in a school district setting. If you or your department processes data requests in a school district, and your educational background is a graduate program in the social sciences, chances are you struggle with processing data requests efficiently and accurately as well. Here’s what I had to learn on my own to simplify and save time processing data requests, while maximizing the quality of what I was producing.

The Problem

Before I write about the ways I have found to work smarter, not harder, while processing data requests, I think it’s worth describing why graduate programs in the social sciences do a poor job of preparing people to work with educational data in school districts.

Statistics classes provide unrealistically ideal data

In statistics courses, students are presented with unrealistically ideal data that require minimal transformations or cleaning before the desired statistical analyses can be performed. In contrast, real educational data that school districts deal with can come from multiple sources, including the student information system, third-party vendors who provide assessments to the school district, and survey software. Putting these data together for the purpose of responding to data requests often involves a lot of cleaning and restructuring of data files.

Managing data is not part of the graduate curriculum

Perhaps because academics work on research projects, which often span only a few years at most, the long-term management of data collected over the course of a given project is an afterthought. As a result, graduate students are not taught about data management principles, such as the principles of relational databases, that save time, reduce errors, and increase efficiency.

Issues with reproducibility

Imagine if someone in your school district asked you for a list of the schools that showed the greatest growth in reading among fourth graders receiving ELL services over the last three years (say, 2017, 2018, and 2019). You produce the list, and your colleague is happy. Next year, she asks you for an update of the same list. You produce the list again, updated for 2018, 2019, and 2020, but this time she is not so happy. The 2018 and 2019 lists are not the same as they were when you ran the same report, for the same dates, last year. How could that happen? There are two reasons: the software and the data.

First, many social science statistics courses, and therefore many school district data departments, use software such as SPSS, which allows users to point and click their way through the data cleaning, transformation, and analysis steps. But this point-and-click approach makes it difficult to track all the actions that took place (and the order in which they took place) to generate a specific report. That, in turn, makes it difficult to produce the same report in exactly the same way again at a later time, for an apples-to-apples comparison. It also makes it difficult to detect any errors or incorrect assumptions that may have been made during the point-and-click process.

Second, within the database that supports a district’s student information system, many attributes are likely to change with a certain degree of frequency (e.g., a student’s ethnicity might get updated, free/reduced lunch status may change within a school year, a student may move within or out of the district, etc.). Routine changes to student records interfere with the reproducibility of reports from year to year.

The Solution

Organize your data

In order to quickly, efficiently, and accurately execute data requests, the first and most important step is to organize the data. A good way to do this is to create a set of historical “master” tables that you continue to add to and treat as the “true” records—sort of frozen in time—that you can refer to as needed, without going into the district’s main student information system. This is not the same as creating multiple copies of the same data files with slight modifications each time you query the data (please don’t do this!). It is best practice to have a single master data table set that contains the master set of records and reflects their changes over time. The following principles will help you create these master tables and use them effectively.

Data from the student information system

To organize data from your district’s student information system for use in processing data requests, I recommend creating one master table for each table from the student information system that contains information you might need to include in a report or list (e.g., the “student demographics” table, the “student attendance” table, etc.). Here’s how:

On a consistent schedule, query the student information system for the data points of interest. To the resulting table, add a column reflecting the time point (e.g., “Fall”, “Winter”, “Spring”) and an additional column reflecting the school year (e.g., “2018”, “2019”). Once you have multiple tables, each created from a discrete “snapshot” in time, combine the rows across all of the tables so that you have one table containing all the historically captured rows. Here are two example rows from a student demographics table, illustrating one kind of data change that might occur from one “snapshot” to the next:

In Fall 2018, student 1 was recorded as Asian. By Winter 2018, however, she or her family may have updated her ethnicity to “Two or more races”. Therefore, if we processed a data request for average attendance by ethnicity in the Fall of 2018, student 1’s record would be aggregated with the attendance data of other Asian students. So far, so good. Without this historical demographics table, however, if we run the same report at a later date, after student 1’s ethnicity was updated, even though we are querying for retrospective information on Fall 2018 attendance, student 1 will be counted among students with two or more races, because that is her current ethnicity classification (at the time the report is being generated). Our historical demographics table helps to avoid this pitfall and ensure reproducibility of results.

Organizing data from external systems

For student data originating from external sources (e.g., state standardized assessment data, third-party assessment vendor data, etc.), create one table for each family of assessments/instruments, where a family means a particular vendor’s instrument suite (e.g., MCA assessment, MAP assessment, etc.). These files may be provided by vendors as flat files containing just one row per student per assessment record (e.g., a file for the Fall 2018 assessment period, a separate file for the Spring 2019 assessment period, etc.), so in order to create a single table for each family of assessments, some transformations will be required. As with the historical table using district data, add a column reflecting the relevant time point (e.g., “Fall”, “Spring”) and a column reflecting the school year (e.g. “2018”, “2019”). For this table, though, also add a column reflecting the subject covered by the assessment. Here’s an example of what the first few records from an MCA assessment table might look like:

Use code

Rather than using point-and-click software to analyze the master data files for processing data requests (leaving no record of what was done to the data), use a software program that lets you write code to perform the necessary data cleaning, transformations, and analyses. I use and recommend the statistical programming language R (https://cran.r-project.org/) and RStudio software (https://www.rstudio.com/). There are at least three key advantages to writing code over using point-and-click software to execute data requests:

  1. Using code provides for 100% reproducible results, unlike having to remember the manual steps used in point-and-click software.
  2. It’s easier to identify the source of an error when using code because each step is documented in the exact sequence in which it was performed.
  3. Saving the code as a file takes up a trivial amount of computer space, compared with saving multiple copies of data files.

See It in action

To get a sense of how quickly you can process data requests when you follow the above principles, check out this short video example: https://youtu.be/SYtQexW5Hjk

Find this blog post helpful? Then sign up for our newsletter (if you haven’t already) so you can get notified of new posts. Just click on the Subscribe button below and enter your email address.

Amanuel is a small business owner who helps school districts quickly, accurately, and affordably calculate the impact of curricula, programs, or policies on important student outcomes through quantitative impact evaluations. Go to www.parsimonyinc.com to learn more.