Dates as data
Overview
Teaching: 10 min
Exercises: 15 minQuestions
What are good approaches for handling dates in spreadsheets?
Objectives
Describe how dates are stored and formatted in spreadsheets.
Describe the advantages of alternative date formatting in spreadsheets.
Demonstrate best practices for entering dates in spreadsheets.
Dates are the scourge of spreadsheets! They are the cause of a huge number of problems with misinterpreted data.
When working with data, the goal is to remove, as much as humanly possible, any ambiguity. But ambiguity can creep into your dates when there are regional variations (e.g. UK vs. US data formats), or if different software products (e.g., LibreOffice, Microsoft Excel, Gnumeric) are used. Ambiguity also arises when spreadsheet programs attempt to “help” you, for example by filling in what it believes is the missing year in the date you expected to only include a day and a month.
Additionally, Excel can turn things that are not dates into dates, for example names or identifiers like MAR1, DEC1, OCT4. Recently, scientists had to rename human genes to stop Microsoft Excel from misreading them as dates, as it was estimated that 20% genetics research papers contained errors!
The image below demonstrates some of the many date formatting options available in Excel.
With such a bewildering array of date formats, combined with the fact that many real-world data sets are built up over time by different people, using different programs, all of whom may be making different assumptions, it becomes difficult to believe that the dates in any sizeable dataset can be accurate! Fortunately, there is a solution. The best route to unambiguously storing dates is to use separate columns for the day, month and year. Before we get to this solution, let’s first investigate some of the major issues with how dates are stored in spreadsheets.
Dates stored as integers
We have seen that Excel displays dates in many different formats, but it stores dates for its own purposes in only one format: a number. This number represents how many days have elapsed since 31 December 1899 (the reason for this is disconcertingly long-winded and quirky). It means 1 January 1900 is stored as number 1, 2 January 1900 is stored as number 2, and so on. In the example above, 2 July 2014 is stored as the number 41822.
Storing dates as a number has some advantages: it’s easy to add days to a given date using simple arithmetic. If you had a plan to conduct interviews every ninety days, you could type:
=B2+90
And Excel would return:
30-Sep
In most cases, it retains the format of the cell that is being operated upon. Month and year rollovers are internally tracked and applied.
However, there is also a significant problem with this system, which is that Excel refuses to believe that dates existed before 31 December 1899. Open a spreadsheet and add the following to the A1 cell:
1 January 1900
Now move to a new cell and enter:
=A1+2
Excel will provide the answer 3 January 1900 (in some format) because the result of this arithmetic returns a date 2 days after 1 January 1900. Now try removing 2 days. Move to another cell and enter:
=A1-2
Excel returns an error, because it is incapable of understanding that dates existed prior to 1 January 1900. This poses a huge problem for storing historical data!
Note on adding and subtracting dates and times
Adding months and years to a date is slightly trickier than adding days because we need to make sure that we are adding the amount to the correct entity (i.e. month or year). To do so,
- first we need to extract single entities (day, month or year) from the date (how to do that is explained below),
- then, we can add values to each entity (e.g. month or year),
- finally, we need to reconstruct the date from the updated day, month, year entities using the
DATE()
function.Times are handled in a similar way; seconds can be directly added, but to add hours and minutes we need to make sure that we are adding the quantities to the correct entities. Extracting seconds, minutes and hours from a time is explained in one of the exercises below.
Excel’s date systems on Macs
In the past, Excel used to entertain a second date system, the 1904 date system, as the default in Excel for Macs. This system assigns a different serial number to dates than the 1900 date system (i.e. it starts counting time from 1 January 1904). Because of this, there were accuracy issues with dates when exporting data from Excel between Macs and PCs (dates were off by 4 years and 1 day). Luckily, Excel for Macs now defaults to the 1900 date system and supports dates as early as January 1, 1900.
Regional date formatting issues
Different countries write dates differently. If you are in the UK, you will interpret 7/12/88 as the 7th of December 1988. A researcher in the US will interpret the same entry as the 12th of July 1988. This regional variation is handled automatically by your spreadsheet program through a setting in the preferences. Errors are easily added to your spreadsheet if your preferences are set incorrectly or, more likely, if you share a spreadsheet with a person from a country that uses a different date format. It is especially devastating if only some of the date formats are changed, which leaves you with a spreadsheet full of ambiguous dates and no way to repair them.
Exercise
Issues with missing bits of dates
If you miss out some part of a date - spreadsheet programmes such as Excel will do their best to guess and form a valid date (so they can store it properly internally) from your partial information. This will happen automatically and often without you even noticing, but can cause you some headache down the line.
Dates with no year
If no year is specified, the spreadsheet program will assume you mean the current year and will insert that value. This may be incorrect if you are working with historical data so be very cautious when working with data that does not have a year specified within its date variable.
Exercise
Dates with 2-digit year
When you type a date using a 2-digit year number (e.g. “88”), spreadsheet programmes use specific rules to determine which century to use for the date. For example, Excel determines the century by using a cutoff year of “29”, which means that 2-digit years 00-29 will be treated to be in the 21st century (i.e. after 2000), and 2-digit year 30-99 will be treated to be in the 20st century (i.e. after 1900).
Exercise
Note on missing bits of dates
Make sure you remove any ambiguity when working with dates (and data in general), so as to avoid situations where spreadsheet programmes are making guesses on what your data should be.
Issues when exporting data containing dates
Exercise
Note on exporting
Some versions of Excel, when exporting into a text-based format (such as CSV), will export its internal date integer representation of the date instead of the date’s value. This can potentially lead to problems if you use other software to manipulate the data as they may not understand Excel’s date encodings.
Historical data
As far as Excel is concerned, time began on 31 December 1899 and any date before this time is not real. If you are mixing historic data from before and after this date, Excel will translate only the post-1900 dates into its internal format, resulting in mixed data. If you are working with historic data, be extremely careful with your dates!
Preferred date formats
Entering a date in a single cell is quick and easy, but it is open to so many potential problems that a new solution is needed if we are to remove ambiguity from spreadsheet dates. That solution is to separate dates into their component parts. Let’s have a look at some good options for storing dates.
Storing dates as YEAR, MONTH, DAY
As previously mentioned, storing dates in YEAR, MONTH, DAY format is one good alternative for storing dates and reduces the risk of ambiguity.
For instance, the following is a spreadsheet that represents insect counts that were taken every few days over the summer:
According to Excel, this person had been collecting bugs over a number of years, including some dates in the future!
Exercise
As for dates, times are handled in a similar way and there are functions to extract hours, minutes and seconds.
(Optional) Exercise
Storing dates as a single string
Another alternative is to convert the date string into a single string using the YYYYMMDDhhmmss
format.
For example the date March 24, 2015 17:25:35
would become 20150324172535
, where:
- YYYY: the full year, i.e. 2015
- MM: the month, i.e. 03
- DD: the day of month, i.e. 24
- hh: hour of day, i.e. 17
- mm: minutes, i.e. 25
- ss: seconds, i.e. 35
Such strings will be correctly sorted in ascending or descending order and can be correctly parsed by the receiving data analysis software. Make sure your column containing such data is formatted as ‘text’.
Key Points
Use extreme caution when working with date data.
Splitting dates into their component values can make them easier to handle.