Resource Guides: Analysis-Ready Data Sets Tutorial: Excel

Best Practices for Creating and Formatting Two-Dimensional Data Files

With Excel Examples!

1. All data should be labeled and have unique field names

When you use Excel, you have to do this yourself.

Example:

Be sure to put field names at the top of each column in an Excel spreadsheet. The letters at the top of each column and the numbers at the start of each row are not interpreted by statistical programs. In addition, they do not tell statisticians anything about the type of data contained in cells.

2. Assign each experimental subject a unique ID

When you use Excel, be sure to add a study ID of some sort.

Example:

Be sure to add a study ID to each subject about which you are collecting data; consider these the names of rows. Study IDs must be unique so that rows are not confused with one another.

The data below lacks unique IDs for each subject. The field names at the top of the file make it clear that the data contains data for four subjects from the study baseline (columns A-D) and three subjects from the follow-up (E-H). What is not clear is which subject did not show up for the follow-up. Without knowing that, we don’t know whether her MTOP score (whatever that is) increased or decreased.

How to check that each subject ID is unique

When you use Excel to store study data, you should periodically check your subject IDs and ensure that they are unique.

For a small study, you might sort the data by the subject IDs and carefully examine them for duplicates. For a larger study, you can use MS Excel’s Advanced Filtering capability to identify duplicate IDs:

Copy the ID column to a fresh tab,
Select the column and go to the Data tab and click “Advanced” under “Sort & Filter”
Check the “Unique records only" box and click the “OK” button.

If the row numbers turn blue, you have duplicate pids.
To find these duplicates in a large data set:
- Start a new column, let’s call it ‘unique”, and put the number “1” in the first cell.
- Copy the “1” down to all of the remaining rows
- Select all of this data and clear the filter.
- Blank cells will appear wherever a pid as been inadvertently duplicated, as is the case on the right for pid 41.

3. Data should be in rectangular format

Rectangular format is also known as flat files, tidy data, pivot ready data, or long data.

When you use Excel, it is possible to enter data in a non-rectangular fashion.

Example:

Rectangular format means that there are no duplicate values in any single row.

Original

Non-rectangular formats like this may look like a reasonable way to store data, but the original file above cannot be read in that form by most statistical packages. Non-rectangular data is difficult to analyze-even in MS Excel. It cannot be sorted easily, cannot be used in a pivot table, cannot be examined by using filters, and can be easily misunderstood.

Even though there are very few data points, figuring out whether MTOP scores increased or decreased from follow-up to baseline would be a difficult task.

The file from the above image is much easier to analyze, through software and visually, when recorded rectangularly, such that each cell is meaningful. Each cell is meaningful because it contains an attribute (designated by the name of the column) for a record (designated by the ID of the row, or pid in this example.)

Version 2 below is stored longitudinally, and uses a phase variable to delineate the different visits.

Version 2

Alternatively, version 3 below uses different fields for baseline and follow-up information, so it doesn’t need a phase variable to be understood. This is often termed “wide.”

Version 3

Both of these forms are acceptable, and REDCap can store data either way. They each present analytical problems that depend on the type of analysis being conducted. If your study design and analysis are complicated, you might consider discussing these different options with a statistician prior to structuring your study database.

4. Rows should represent the appropriate unit of analysis and columns should represent the unique attributes of the rows

When you use Excel, it is possible to transpose this data.

Example:

If you are storing data in MS Excel, it is easy to create a dataset using columns to represent units of analysis and using rows to store the attributes of those units, as in the example below.

Some software packages allow the creation and/or analysis of data files like the above, but most do not. MS Excel allows the creation of such files, but it cannot analyze the data thus created (for example, the dataset above cannot be sorted by gender, or filtered, or used to create a pivot table showing cross-tabulations or frequencies). Plus, it’s difficult to read.

If you have already started, however, it is easy to use MS Excel to transpose the data, so that attributes are in columns and records are in rows. Simply

Select the rectangle, including the field names and copy it
Click an empty cell that has enough space to the right and down to fit the transposed data
Then click the dropdown paste menu
Then click the “transpose” icon.

5. Data files should contain the same number of columns in each row

When you use Excel, problems arise when data are missing in the middle of a row.

Example:

Problems often arise when data are missing in the middle of a row such as in the example below, where data row 4 is missing the FollowUp_MTOP value and the subsequent values have shifted over.

To fix a problem like this in Excel, simply insert a cell where the missing value should be and shift the values over to their proper place.

Right click on the cell and select “insert”
In the dialog box, select shift cells right
Here are the results of inserting the extra cell, you can now see each row has the appropriate number of columns and the data is no longer shifted to the left in row 4

6. Data should be atomic within each column

When you use Excel, discrete data should not be combined in a single column.

Example:

In the example below, column G (Name), column H (Hospital) and column I (Site/Doctor) all contain data that should be separated out into discrete columns.

If you are working in Excel, the Text to Columns feature will help to fix this problem.

Select the data that contains values to be separated (in the example column G called Name is selected), add empty columns to the right of the column being split to accommodate the number of columns that the single column will be split into, then click on Text to Columns on the data tab as shown in the example below.

Excel will launch a dialog box that will walk you through the process of separating your data.

The result of text to column procedure is shown below, don’t forget to rename the columns and apply an appropriate name to the new column

Results of Text to Columns

With renamed columns