9  Filters and selections

More often than not, we have more data than we want. Sometimes we need to be rid of that data. In dplyr, there’s two ways to go about this: filtering and selecting.

Filtering creates a subset of the data based on criteria. All records where the count is greater than 10. All records that match “Nebraska”. Something like that. Filtering works with rows – when we filter, we get fewer rows back than we start with.

Selecting simply returns only the fields named. So if you only want to see Year and County, you select those fields. When you look at your data again, you’ll have two columns. If you try to use one of your columns that you had before you used select, you’ll get an error. Selecting works with columns. You will have the same number of records when you are done, but fewer columns of data to work with.

Let’s work with the salaries data from the University of Nebraska. It has data from all NU campuses, but only one of them is our campus, so let’s filter out everyone else.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
salaries <- read_csv("data/nusalaries1819.csv")
Rows: 13039 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Employee, Position, Campus, Department
num (3): Budgeted Annual Salary, Salary from State Aided Funds, Salary from ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data we want to filter on is in Campus. So we’re going to use filter and something called a comparison operator. We need to filter all records equal to UNL. The comparison operators in R, like most programming languages, are == for equal to, != for not equal to, > for greater than, >= for greater than or equal to and so on.

Be careful: = is not == and = is not “equal to”. = is an assignment operator in most languages – how things get named.

unl <- salaries |> filter(Campus == "UNL")

head(unl)
# A tibble: 6 × 7
  Employee             Position         Campus Department Budgeted Annual Sala…¹
  <chr>                <chr>            <chr>  <chr>                       <dbl>
1 Abbott, Frances M    Staff Secy III   UNL    FM&P Faci…                  37318
2 Abdel-Monem, Tarik L Research Specia… UNL    Public Po…                  58502
3 Abel, Marco          Chairperson      UNL    English                     64470
4 Abel, Marco          Professor        UNL    English                     39647
5 Abel, Rick A         Control Systems… UNL    FM&P Buil…                  57178
6 Abendroth, Curtis L  Asst Dir Facili… UNL    Housing F…                  79037
# ℹ abbreviated name: ¹​`Budgeted Annual Salary`
# ℹ 2 more variables: `Salary from State Aided Funds` <dbl>,
#   `Salary from Other Funds` <dbl>

And just like that, we have just UNL, which we can verify looking at the head, the first six rows.

We also have more data than we might want. For example, the salary data is only in the Budgeted Annual Salary column. The other two salary fields are useless detail.

To simplify our dataset, we can use select.

selected_unl <- unl |> select(Employee, Position, Campus, `Budgeted Annual Salary`)

head(selected_unl)
# A tibble: 6 × 4
  Employee             Position                    Campus Budgeted Annual Sala…¹
  <chr>                <chr>                       <chr>                   <dbl>
1 Abbott, Frances M    Staff Secy III              UNL                     37318
2 Abdel-Monem, Tarik L Research Specialist         UNL                     58502
3 Abel, Marco          Chairperson                 UNL                     64470
4 Abel, Marco          Professor                   UNL                     39647
5 Abel, Rick A         Control Systems Tech/Alarm… UNL                     57178
6 Abendroth, Curtis L  Asst Dir Facilities Mgt/Ma… UNL                     79037
# ℹ abbreviated name: ¹​`Budgeted Annual Salary`

And now we only have four columns of data for whatever salary analysis we might want to do.

9.1 Combining filters

So let’s say we wanted to know how many full professors make more than $100,000. We can do this a number of ways. The first is we can chain together a whole lot of filters.

profs <- salaries |> filter(Campus == "UNL") |> filter(Position == "Professor") |> filter(`Budgeted Annual Salary` > 100000)

nrow(profs)
[1] 312

So that gives us 312 full professors – that’s the top rank of tenured professors – who make more than $100,000. But that’s silly and repetitive, no? We can do better using boolean operators – AND and OR. In this case, AND is & and OR is |.

The difference? With AND, all three things must be true to be included. With OR, any of those three things can be true and it will be included. An assistant professor making $100k at UNO will get included because they make $100k. One of the conditions is true.

Here’s the difference.

andprofs <- salaries |> filter(Campus == "UNL" & Position == "Professor" & `Budgeted Annual Salary` > 100000)

nrow(andprofs)
[1] 312

So AND gives us the same answer we got before. What does OR give us?

orprofs <- salaries |> filter(Campus == "UNL" | Position == "Professor" | `Budgeted Annual Salary` > 100000)

nrow(orprofs)
[1] 7248

So there’s 7,248 unique people in the NU system who are at UNL (6,079 to be exact), are full Professors (1,086 of them), or make more than $100,000 (1,604) of them. Included in that list? Football coach Scott Frost, who makes … ahem … more than $100,000. A smidge more.