Filtering and Sorting Data
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How can we select only a subset of our data to work with?
How can we sort our data?
Objectives
Employ text filters or sub-setting options in facets to filter to a subset of rows.
Sort data by one or multiple columns.
Filtering data
Sometimes you want to view and work only with a subset of data or apply an operation only to a subset. You can do this by applying various filters to your data.
Including/excluding data entries on facets
One way to filter down our data is to use the include
or exclude
buttons on the entries in a text facet.
If you still
have your text facet for scientificName
, you can use it. If you’ve closed that facet, recreate it by selecting Facet
>
Text facet
on the scientificName
column.
- In the text facet, hover over one of the names, e.g.
Baiomys taylori
. Notice that when you hover over it, there are buttons to the right foredit
andinclude
. - Whilst hovering over
Baiomys taylori
, move to the right and click theinclude
option. This will include this species, as signified by the name of the species changing from blue to orange, and new options ofedit
andexclude
will be presented. Note that in the top of the page, “46 matching rows” is now displayed instead of “35549 rows”. - You can include other species in your current filter - e.g. click on
Chaetodipus baileyi
in the same way to include it in the filter. - Alternatively, you can click the name of the species to include it in the filter instead of clicking the
include
/exclude
buttons. This will include the selected species and exclude all others options in a single step, which can be useful. - Click
include
andexclude
on the other species and notice how the entries appear and disappear from the data table to the right.
You can also filter data using other types of facets - let’s do it as an exercise.
Exercise
Text filters
Another way to filter data is to create a text filter on a column. Close all facets you may have created previously
and reinstate the text facet on the scientificName
column.
- Click the down arrow next to
scientificName
>Text filter
. AscientificName
filter will appear on the left margin below the text facet. -
Type in
bai
into the text box in the filter and press return. At the top of the page it will report that, out of the 35549 rows in the raw data, there are 48 rows in which the text has been found within thescientificName
column (and these rows will be selected for the subsequent steps). - Near the top of the screen, change
Show:
to 50. This way, you will see all the matching rows in a single page.
Exercise
Important: Make sure both species are included in your filtered dataset before continuing with the rest of the exercises.
Filters vs. facets
Faceting and filtering look very similar. A good distinction is that faceting gives you an overview of all the data that is currently selected, while filtering allows you to select a subset of your data for further inspection and analysis.
Sorting data
Sorting data is a useful practice for detecting outliers in data - potential errors and blanks will sort to the top or the bottom of your data.
To sort the data in a column, click the arrow next to your chosen column name and select Sort...
. This will open a pop-up
that will present you with different options, e.g. whether you wish to sort by text
, numbers
, dates
or booleans
(i.e. TRUE
or FALSE
values). Additional options will appear to allow you to fine-tune your sorting - e.g. you can
specify where to place Blanks and Errors in the sorted results.
Exercise
The first time you sort a column, the first option will present as Sort...
. If you re-sort a column that you
have already sorted, the drop-down menu changes slightly: the first option will be Sort
and the following sub-options
will be presented:
Sort...
- This option enables you to choose a new sort.Reverse
- This option allows you to reverse the order of the sort.Remove sort
- This option allows you to undo your sort.
It may not always be that obvious in OpenRefine that you performed a sort. Once you complete the sort for the first time
a Sort
button will appear at the top of the page as an indicator that the data has been sorted. It will disappear if
you remove the sort.
Exercise
Sorting by multiple columns
Once you have completed one sort, any further sort will be performed by default as an additional sort. For example, if
you take the data that has been sorted by plot
and the sort by year
, OpenRefine will present the data sorted by
plot
and then by year
. If you wish to restart the sorting process, check the sort by this column
alone
box in the Sort
pop-up menu.
Exercise
If you go back to one of the already sorted columns and select Sort
> Remove sort
, that column is removed from
your multiple sort. If it is the only column sorted, then the data revert to their original order.
Exercise
Sort by
year
,month
andday
in some order. Be creative: try sorting asnumbers
ortext
, and in reverse order (largest to smallest
orz to a
).Use
Sort
>Remove sort
to remove the sort on the second of three columns. Notice how that changes the order.
Sorting does not change data
You may have noticed that after any of the sorting steps there was nothing to undo/redo. This is because you have only reordered and not modified you data. If you want to revert to the original order of the data - make sure you remove all “sorts” using the
Sort
>Remove sort
option.
Key Points
OpenRefine provides various ways to sort and filter data without affecting the raw data.