Skip to Main Content
UMass Chan Medical School, Lamar Soutter Library. Education. Research. Health Care. Empowering the future. Preserving the past.
UMass Chan Medical School Homepage Lamar Soutter Library Homepage

Analysis-Ready Data Sets Tutorial

Step by step tutorial on how to make your data analysis-ready using Excel or REDCap.

Best Practices for Creating and Formatting Two-Dimensional Data Files

With Excel Examples!

1. All data should be labeled and have unique field names

When you use Excel, you have to do this yourself.

Example:

Be sure to put field names at the top of each column in an Excel spreadsheet. The letters at the top of each column and the numbers at the start of each row are not interpreted by statistical programs. In addition, they do not tell statisticians anything about the type of data contained in cells.

Excel spreadsheet with columns numbered A, B, C along the top and the rows numbered 1, 2, 3, 4 on the left-hand side. The column names Gender, Baseline MTOP, and Ethnicity/family country of origin have been added to the first row in columns A, B, and C respectively

2. Assign each experimental subject a unique ID

When you use Excel, be sure to add a study ID of some sort.

Example:

Be sure to add a study ID to each subject about which you are collecting data; consider these the names of rows. Study IDs must be unique so that rows are not confused with one another.

The data below lacks unique IDs for each subject. The field names at the top of the file make it clear that the data contains data for four subjects from the study baseline (columns A-D) and three subjects from the follow-up (E-H). What is not clear is which subject did not show up for the follow-up. Without knowing that, we don’t know whether her MTOP score (whatever that is) increased or decreased.

Excel spreadsheet with columns labeled: gender, Baseline MTOP, Ethnicity/family country of origin, visit date, gender, follow-up MTOP, Ethnicity/family country of origin, visit date. The first set of 4 columns have 4 rows of data and the second set of 4 columns has 3 rows of data

How to check that each subject ID is unique

When you use Excel to store study data, you should periodically check your subject IDs and ensure that they are unique.

For a small study, you might sort the data by the subject IDs and carefully examine them for duplicates. For a larger study, you can use MS Excel’s Advanced Filtering capability to identify duplicate IDs:

  1. Copy the ID column to a fresh tab,
  2. Select the column and go to the Data tab and click “Advanced” under “Sort & Filter” pid column selected and hovering over advanced in the filter section under the data tab
  3. Check the “Unique records only" box and click the “OK” button. 
    Advanced filter options screen with filter the list, in-place and unique records only box selected
    If the row numbers turn blue, you have duplicate pids.
    pid row with a single duplicated pid in row 9. The row numbers are blue.
  4. To find these duplicates in a large data set:
    • Start a new column, let’s call it ‘unique”, and put the number “1” in the first cell.
    • Copy the “1” down to all of the remaining rows

    • excel spreadsheet with a pid column and a unique column. The unique column has been filled in with 1 in every cell
    • Select all of this data and clear the filter.
      clearing filter under the data tab in excel
    • Blank cells will appear wherever a pid as been inadvertently duplicated, as is the case on the right for pid 41.
      Excel spreadsheet with a pid column and a unique column. In the unique column, there is a blank cell at row nine where the duplicate pid is located.

3. Data should be in rectangular format

Rectangular format is also known as flat files, tidy data, pivot ready data, or long data.

When you use Excel, it is possible to enter data in a non-rectangular fashion.

Example:

Rectangular format means that there are no duplicate values in any single row.

Original

Excel spreadsheet with columns labeled pid, Gender, Baseline MTOP, Ethnicity/family country of origin, Visit Date, pid, Gender, Follow-up MTOP, Ethnicity/family country of origin, Visit Date. The 2 pid columns are circled.

Non-rectangular formats like this may look like a reasonable way to store data, but the original file above cannot be read in that form by most statistical packages. Non-rectangular data is difficult to analyze-even in MS Excel. It cannot be sorted easily, cannot be used in a pivot table, cannot be examined by using filters, and can be easily misunderstood.

Even though there are very few data points, figuring out whether MTOP scores increased or decreased from follow-up to baseline would be a difficult task.

The file from the above image is much easier to analyze, through software and visually, when recorded rectangularly, such that each cell is meaningful. Each cell is meaningful because it contains an attribute (designated by the name of the column) for a record (designated by the ID of the row, or pid in this example.)

Version 2 below is stored longitudinally, and uses a phase variable to delineate the different visits.

Version 2

excel spreadsheet with column labeled: pid, Gender, MTOP, Visit_Date, Phase. The phase column has some rows as b phase and some as f phase.

Alternatively, version 3 below uses different fields for baseline and follow-up information, so it doesn’t need a phase variable to be understood. This is often termed “wide.”

Version 3

excel spreadsheet with columns labeled: pid, Gender, Baseline_MTOP, Baseline_Visit_Date, FollowUp_MTOP, FollowUp_Visit_Date

Both of these forms are acceptable, and REDCap can store data either way. They each present analytical problems that depend on the type of analysis being conducted. If your study design and analysis are complicated, you might consider discussing these different options with a statistician prior to structuring your study database.

4. Rows should represent the appropriate unit of analysis and columns should represent the unique attributes of the rows

When you use Excel, it is possible to transpose this data.

Example:

If you are storing data in MS Excel, it is easy to create a dataset using columns to represent units of analysis and using rows to store the attributes of those units, as in the example below.

excel spreadsheet with rows labeled pid, Gender, MTOP, Visit Date, Phase

Some software packages allow the creation and/or analysis of data files like the above, but most do not. MS Excel allows the creation of such files, but it cannot analyze the data thus created (for example, the dataset above cannot be sorted by gender, or filtered, or used to create a pivot table showing cross-tabulations or frequencies). Plus, it’s difficult to read.

If you have already started, however, it is easy to use MS Excel to transpose the data, so that attributes are in columns and records are in rows. Simply

  1. Select the rectangle, including the field names and copy it
  2. Click an empty cell that has enough space to the right and down to fit the transposed data
  3. Then click the dropdown paste menu
    Paste special feature in the home tab of the excel ribbon
  4. Then click the “transpose” icon.

5. Data files should contain the same number of columns in each row

When you use Excel, problems arise when data are missing in the middle of a row.

Example: 

Problems often arise when data are missing in the middle of a row such as in the example below, where data row 4 is missing the FollowUp_MTOP value and the subsequent values have shifted over.

excel spreadsheet with columns labeled pid, Gender, Baseline_MTOP, Baseline_Visit_Date, FollowUp_MTOP, FollowUp_Visit_Date, Hospital_City, Hospital_Name. In the fourth row there is a date in the FollowUp_MTOP column, a city name in the FollowUp_Visit_Date column, and a hospital name in the Hospital_City column, and nothing in the Hospital_Name column

To fix a problem like this in Excel, simply insert a cell where the missing value should be and shift the values over to their proper place.

  1. Right click on the cell and select “insert” 
    excel spreadsheet where a row appears shifted. The cell that is left-most in the shifted section of the row is selected and in the menu Insert.. is highlighted
  2. In the dialog box, select shift cells right 
    excel insert dialogue box with shift cells right selected
  3. Here are the results of inserting the extra cell, you can now see each row has the appropriate number of columns and the data is no longer shifted to the left in row 4 
    excel spreadsheet with columns labeled pid, Gender, Baseline_MTOP, Baseline_Visit_Date, FollowUp_MTOP, FollowUp_Visit_Date, Hospital_City, Hospital_Name. In the fourth row, the blank cell is no in the FollowUp_MTOP column and the type of data in each column is consistent.

6. Data should be atomic within each column

When you use Excel, discrete data should not be combined in a single column.

Example:

In the example below, column G (Name), column H (Hospital) and column I (Site/Doctor) all contain data that should be separated out into discrete columns.

excel spreadsheet with columns labeled pid, Gender, Baseline_MTOP, Baseline_Visit_Date, FollowUp_MTOP, FollowUp_Visit_Date, Name, Hospital, Site/Doctor

If you are working in Excel, the Text to Columns feature will help to fix this problem.

Select the data that contains values to be separated (in the example column G called Name is selected), add empty columns to the right of the column being split to accommodate the number of columns that the single column will be split into, then click on Text to Columns on the data tab as shown in the example below.

Name column highlighted and there is an empty column to the right. Text to columns is highlighted in the ribbon under the data tab

Excel will launch a dialog box that will walk you through the process of separating your data.

Excel convert text to columns wizard step 1 of 3 with delimited selected

Excel convert text to columns wizard step 2 of 3 with comma selected

Excel convert text to columns wizard step 3 of 3 with General selected

The result of text to column procedure is shown below, don’t forget to rename the columns and apply an appropriate name to the new column

Results of Text to Columns

excel spreadsheet with columns labeled pid, Gender, Baseline_MTOP, Baseline_Visit_Date, FollowUp_MTOP, FollowUp_Visit_Date, Name, [blank],Hospital, Site/Doctor

With renamed columns

excel spreadsheet with columns labeled pid, Gender, Baseline_MTOP, Baseline_Visit_Date, FollowUp_MTOP, FollowUp_Visit_Date, First_Name, Last_Name, Hospital, Site/Doctor