Dates as data

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • What are good approaches for handling dates in spreadsheets?

Objectives
  • Describe how dates are stored and formatted in spreadsheets.

  • Describe the advantages of alternative date formatting in spreadsheets.

  • Demonstrate best practices for entering dates in spreadsheets.

Dates are the scourge of spreadsheets! They are the cause of a huge number of problems with misinterpreted data.

When working with data, the goal is to remove, as much as humanly possible, any ambiguity. But ambiguity can creep into your dates when there are regional variations (e.g. UK vs. US date formats), or if different software products (e.g., LibreOffice, Microsoft Excel, Gnumeric) are used. Ambiguity also arises when spreadsheet programs attempt to “help” you, for example by filling in what it believes is the missing year in the date you expected to only include a day and a month.

Additionally, Excel can turn things that are not dates into dates, for example names or identifiers like MAR1, DEC1, OCT4. Recently, scientists had to rename human genes to stop Microsoft Excel from misreading them as dates, as it was estimated that 20% genetics research papers contained errors!

The image below demonstrates some of the many date formatting options available in Excel.

Many formats, many ambiguities

With such a bewildering array of date formats, combined with the fact that many real-world data sets are built up over time by different people, using different programs, all of whom may be making different assumptions, it becomes difficult to believe that the dates in any sizeable dataset can be accurate! Fortunately, there is a solution. The best route to unambiguously storing dates is to use separate columns for the day, month and year. Before we get to this solution, let’s first investigate some of the major issues with how dates are stored in spreadsheets.

Dates stored as integers

We have seen that Excel displays dates in many different formats, but it stores dates for its own purposes in only one format: a number. This number represents how many days have elapsed since 31 December 1899 (the reason for this is disconcertingly long-winded and quirky). It means 1 January 1900 is stored as number 1, 2 January 1900 is stored as number 2, and so on. In the example above, 2 July 2014 is stored as the number 41822.

Storing dates as a number has some advantages: it’s easy to add days to a given date using simple arithmetic. If you had a plan to conduct interviews every ninety days, you could type:

=B2+90

And Excel would return:

30-Sep

In most cases, it retains the format of the cell that is being operated upon. Month and year rollovers are internally tracked and applied.

However, there is also a significant problem with this system, which is that Excel refuses to believe that dates existed before 31 December 1899. Open a spreadsheet and add the following to the A1 cell:

 1 January 1900

Now move to a new cell and enter:

=A1+2

Excel will provide the answer 3 January 1900 (in some format) because the result of this arithmetic returns a date 2 days after 1 January 1900. Now try removing 2 days. Move to another cell and enter:

=A1-2

Excel returns an error, because it is incapable of understanding that dates existed prior to 1 January 1900. This poses a huge problem for storing historical data!

Note on adding and subtracting dates and times

Adding months and years to a date is slightly trickier than adding days because we need to make sure that we are adding the amount to the correct entity (i.e. month or year). To do so,

  • first we need to extract single entities (day, month or year) from the date (how to do that is explained below),
  • then, we can add values to each entity (e.g. month or year),
  • finally, we need to reconstruct the date from the updated day, month, year entities using the DATE() function.

Times are handled in a similar way; seconds can be directly added, but to add hours and minutes we need to make sure that we are adding the quantities to the correct entities. Extracting seconds, minutes and hours from a time is explained in one of the exercises below.

Excel’s date systems on Macs

In the past, Excel used to entertain a second date system, the 1904 date system, as the default in Excel for Macs. This system assigns a different serial number to dates than the 1900 date system (i.e. it starts counting time from 1 January 1904). Because of this, there were accuracy issues with dates when exporting data from Excel between Macs and PCs (dates were off by 4 years and 1 day). Luckily, Excel for Macs now defaults to the 1900 date system and supports dates as early as January 1, 1900.

Pre-formatting cells

One approach to stop Excel from automatically changing things to dates is to pre-format cells with the type of data that will be entered into those cells. Doing this tells Excel exactly the type of data you are inputting, stopping it from being clever and interpreting your input as a date. This is especially helpful if some of your data is, for example, gene names or something similar which Excel could mistake as a date. To set the format of cells, highlight the cells you want to format and right click to bring up the context menu, selecting the “Format Cells…” option.

This will bring up the “Format Cells” window, where we can control how the contents of a cell are displayed. The type of data can be set in the “Number” tab, where there are several formats of data you can choose. To stop something like a gene name from being changed to a date, select the “Text” type data format as shown below.

After we’ve done this, any text input into the cells will stay as text and won’t be converted to a date! If there was already data in the cells before formatting, anything which was converted to a date won’t be converted back to the original input. Unfortunately when something is converted to a date, it’s changed to the internal format Excel uses to store dates, which is a number! So when we set the format of the cell to something like text, the converted dates are just numbers.

Preferred date formats

Entering a date in a single cell is quick and easy, but it is open to so many potential problems that a new solution as needed if we are to remove ambiguity from spreadsheet dates. That solution is to separate dates into their component parts. Let’s have a look at some good options for storing dates.

Storing dates as YEAR, MONTH, DAY

As previously mentioned, storing dates in YEAR, MONTH, DAY format is one good alternative for storing dates and reduces the risk of ambiguity.

For instance, the following is a spreadsheet that represents insect counts that were taken every few days over the summer:

So, so ambiguous, it is even confusing Excel

According to Excel, this person had been collecting bugs over a number of years, including some dates in the future!

Exercise

Challenge: pulling month, day and year out of dates.

  1. Look at the tab ‘dates’ in our messy data spreadsheet. Extract month, day and year from the dates in the “Date collected” column into three new columns called ‘month’, ‘day’, ‘year’. To do so, you can use the following built-in Excel functions:

    `YEAR()`
    `MONTH()`
    `DAY()`
    
  2. Apply the formulas on each of the row of table ‘plot 3’ in the ‘dates’ tab. To do so, drag the formula for the first row down to the last row.
  3. Make sure the new columns are formatted as a number and not as a date.

Solution

  1. Create column ‘day’ for the first row of the table (note that this is not the first row of the spreadsheet) and insert formula =DAY(A3) in cell E3. This applies the function DAY() on the value in cell A3, which is the date of the first observation.
  2. Drag this formula to the last row of the table - you will see the formulas and calculations for the month appearing in the cells below. drag_dates
  3. To format the column ‘day’ as a whole number, right click on the column ‘E’ (for ‘day’) and select Format cells... option. format_dates Select ‘Number’ and set decimal places to 0. dates_as_number Repeat the process for month and year. You should end up with a table like the one below. dates, exersize 1 Again, as in the previous exercise but this time using the YEAR() function, we can see that Excel saved the year for data as 2015 (the year the data is entered) instead of 2014 (the year the data was collected). This was entirely the mistake of the field assistant but by being helpful Excel managed to “mask” this error.

Storing dates as a single string

Another alternative is to convert the date string into a single string using the YYYYMMDDhhmmss format. For example the date March 24, 2015 17:25:35 would become 20150324172535, where:

Such strings will be correctly sorted in ascending or descending order and can be correctly parsed by the receiving data analysis software. Make sure your column containing such data is formatted as ‘text’.

Storing dates as YEAR, DAY-OF-YEAR

Storing dates as year and day-of-year (DOY) is an alternative method of storing dates with little ambiguity. Statistical models often incorporate year as a factor, or a categorical variable, rather than a numeric variable, to account for year-to-year variation, and DOY can be used to measure the passage of time within a year.

To convert all your dates into DOY format, here is a useful guide:

Converting dates to DOY format

Common issues with dates

In some cases, we’ll still find it helpful to use date data types in our spreadsheets. If this is the case, then you should be aware of a number of common issues with using dates.

Regional dates

Different countries write dates differently. If you are in the UK, you will interpret 7/12/88 as the 7th of December 1988. A researcher in the US will interpret the same entry as the 12th of July 1988. This regional variation is handled automatically by your spreadsheet program through a setting in the preferences. Errors are easily added to your spreadsheet if your preferences are set incorrectly or, more likely, if you share a spreadsheet with a person from a country that uses a different date format. It is especially devastating if only some of the date formats are changed, which leaves you with a spreadsheet full of ambiguous dates and no way to repair them.

Exercise

Look at the tab ‘dates’ in the messy data spreadsheet you used in previous exercises. It contains a copy of the ‘plot 3’ table from the ‘2014’ tab (this is the table that contains the problematic dates). You will notice that there are years missing from the “Date collected” column. Just by looking at the entered dates, can you figure out the date format that was used?

Solution

The date format is ‘MM/DD/YYYY’ because we have entries like ‘2/19’ and ‘5/18’, which means that the person who entered these dates most probably used the US version of Excel where month comes before day. If we did not have such entries, we would not be able to tell what date was entered! For example, is ‘7/11’ ‘7th November’ or ‘11th July’?

Missing bits of dates

If you miss out some part of a date - spreadsheet programmes such as Excel will do their best to guess and form a valid date (so they can store it properly internally) from your partial information. This will happen automatically and often without you even noticing, but can cause you some headache down the line.

Dates with no year

If no year is specified, the spreadsheet program will assume you mean the current year and will insert that value. This may be incorrect if you are working with historical data so be very cautious when working with data that does not have a year specified within its date variable.

Exercise

Look at the tab ‘dates’ in the messy data spreadsheet. It contains a copy of the ‘plot 3’ table from the ‘2014’ tab (this is the table that contains the problematic dates). Select one of the cells in “Date collected” column, for example cell ‘A3’. What is the date Excel thinks is entered? What is the year of the date?

Solution

Select any cell in “Date collected” column. In the field for the cell formula at the top, Excel will display the value that is stored > > internally for this cell. drag_dates From the cell that is selected in the figure above, we can see that Excel has stored the date “08/01/2015” (i.e. 8 January 2015 using the DD/MM/YYYY format). This data was meant to relate to 2014 but Excel has added the year 2015 instead.

The cause of this change is that the field assistant who collected the data initially forgot to add their data for ‘plot 3’ in the dataset. They came back in 2015 to add the missing data into the dataset and entered the dates for ‘plot 3’ as they had been recorded - without the year. Excel automatically interpreted the year as 2015 the year in which the data was entered into the spreadsheet. This exemplifies how easy it is to introduce errors in dates.

Dates with 2-digit year

When you type a date using a 2-digit year number (e.g. “88”), spreadsheet programmes use specific rules to determine which century to use for the date. For example, Excel determines the century by using a cutoff year of “29”, which means that 2-digit years 00-29 will be treated to be in the 21st century (i.e. after 2000), and 2-digit year 30-99 will be treated to be in the 20st century (i.e. after 1900).

Exercise

Type the follwing dates in an Excel spreadsheet:

  • 1/1/00
  • 1/1/01
  • 1/1/29
  • 1/1/30
  • 1/1/31
  • 1/1/99

What year does Excel assumes for each of these dates?

Solution

Excel is using the years 29/30 as the cutoff for the century by default - dates before and including the year “29” will be treated as in the 2000s and year “30” and above in 1900s:

  • 1/1/00 = 2000-Jan-1
  • 1/1/01 = 2001-Jan-1
  • 1/1/29 = 2029-Jan-1
  • 1/1/30 = 1930-Jan-1
  • 1/1/31 = 1931-Jan-1
  • 1/1/99 = 1999-Jan-1

Note on missing bits of dates

Make sure you remove any ambiguity when working with dates (and data in general), so as to avoid situations where spreadsheet programmes are making guesses on what your data should be.

Exporting data containing dates

Exercise

What happens to the dates in the dates tab of the messy spreadsheet if we export this sheet to .csv (Comma–Separated Value format) and then open the file in a plain text editor (like TextEdit or Notepad)? What happens to the dates if we then open the .csv file back in Excel?

Solution

  1. Click to the dates tab of the messy data spreadsheet.
  2. Select File -> Save As in Excel and in the drop down menu for file format select CSV UTF-8 (Comma delimited) (.csv) (or Text CSV (.csv) in LibreOffice). Enter a file name, e.g. dates-export.csv and click Save.
  3. You will see a pop-up that says “This workbook cannot be saved in the selected file format because it contains multiple sheets.” Choose Save Active Sheet or OK depending on your Excel version (In LibreOffice you will see a Confirm File Format dialogue: select Use Text CSV format. You will then be presented with a Export Text File dialogue: select OK. You will then be presented with a Warning dialogue: select OK.)
  4. Close your current messy data spreadsheet (you can reopen it later).
  5. Open the dates-export.csv file (or whatever you named it). Right click and select Open With. Choose a plain text editor (like TextEdit or Notepad) and view the file. Notice that the dates display as month/day without any year information.
  6. Now right click on the file again and open with Excel (or LibreOffice). In Excel, notice that the dates display with the current year, not 2015 as previously assumed by Excel. In LibreOffice, no year is added, but the date is no longer treated as a date. As you can see, exporting data from a spreadsheet program and then importing it back again can fundamentally change the data!

Note on exporting

Some versions of Excel, when exporting into a text-based format (such as CSV), will export its internal date integer representation of the date instead of the date’s value. This can potentially lead to problems if you use other software to manipulate the data as they may not understand Excel’s date encodings.

Key Points

  • Use extreme caution when working with date data.

  • Splitting dates into their component values can make them easier to handle.