6 Filters and selections

More often than not, we have more data than we want. Sometimes we need to be rid of that data. In dplyr, there’s two ways to go about this: filtering and selecting.

Filtering creates a subset of the data based on criteria. All records where the count is greater than 10. All records that match “Nebraska”. Something like that.

Selecting simply returns only the fields named. So if you only want to see School and Attendance, you select those fields. When you look at your data again, you’ll have two columns. If you try to use one of your columns that you had before you used select, you’ll get an error.

Let’s work with our football attendance data to show some examples.

For this walkthrough:

First we’ll need the tidyverse.

library(tidyverse)

Now import the data.

attendance <- read_csv('data/attendance.csv')

Rows: 150 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Institution, Conference
dbl (6): 2013, 2014, 2015, 2016, 2017, 2018

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

So, first things first, let’s say we don’t care about all this Air Force, Akron, Alabama crap and just want to see Dear Old Nebraska U. We do that with filter and then we pass it a condition.

Before we do that, a note about conditions. Most of the conditional operators you’ll understand – greater than and less than are > and <. The tough one to remember is equal to. In conditional statements, equal to is == not =. If you haven’t noticed, = is a variable assignment operator, not a conditional statement. So equal is == and NOT equal is !=.

So if you want to see Institutions equal to Nebraska, you do this:

attendance |> filter(Institution == "Nebraska")

# A tibble: 1 × 8
  Institution Conference `2013` `2014` `2015` `2016` `2017` `2018`
  <chr>       <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Nebraska    Big Ten    727466 638744 629983 631402 628583 623240

Or if we want to see schools that had more than half a million people buy tickets to a football game in a season, we do the following. NOTE THE BACKTICKS.

attendance |> filter(`2018` >= 500000)

# A tibble: 17 × 8
   Institution    Conference `2013` `2014` `2015` `2016` `2017` `2018`
   <chr>          <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama        SEC        710538 710736 707786 712747 712053 710931
 2 Auburn         SEC        685252 612157 612157 695498 605120 591236
 3 Clemson        ACC        574333 572262 588266 566787 565412 562799
 4 Florida        SEC        524638 515001 630457 439229 520290 576299
 5 Georgia        SEC        556476 649222 649222 556476 556476 649222
 6 LSU            SEC        639927 712063 654084 708618 591034 705733
 7 Michigan       Big Ten    781144 734364 771174 883741 669534 775156
 8 Michigan St.   Big Ten    506294 522765 522628 522666 507398 508088
 9 Nebraska       Big Ten    727466 638744 629983 631402 628583 623240
10 Ohio St.       Big Ten    734528 744075 750705 750944 752464 713630
11 Oklahoma       Big 12     508334 510972 512139 521142 519119 607146
12 Penn St.       Big Ten    676112 711358 698590 701800 746946 738396
13 South Carolina SEC        576805 569664 472934 538441 550099 515396
14 Tennessee      SEC        669087 698276 704088 706776 670454 650887
15 Texas          Big 12     593857 564618 540210 587283 556667 586277
16 Texas A&M      SEC        697003 630735 725354 713418 691612 698908
17 Wisconsin      Big Ten    552378 556642 546099 476144 551766 540072

But what if we want to see all of the Power Five conferences? We could use conditional logic in our filter. The conditional logic operators are | for OR and & for AND. NOTE: AND means all conditions have to be met. OR means any of the conditions work. So be careful about boolean logic.

attendance |> filter(Conference == "Big 10" | Conference == "SEC" | Conference == "Pac-12" | Conference == "ACC" | Conference == "Big 12")

# A tibble: 51 × 8
   Institution    Conference `2013` `2014` `2015` `2016` `2017` `2018`
   <chr>          <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama        SEC        710538 710736 707786 712747 712053 710931
 2 Arizona        Pac-12     285713 354973 308355 338017 255791 318051
 3 Arizona St.    Pac-12     501509 343073 368985 286417 359660 291091
 4 Arkansas       SEC        431174 399124 471279 487067 442569 367748
 5 Auburn         SEC        685252 612157 612157 695498 605120 591236
 6 Baylor         Big 12     321639 280257 276960 275029 262978 248017
 7 Boston College ACC        198035 239893 211433 192942 215546 263363
 8 California     Pac-12     345303 286051 292797 279769 219290 300061
 9 Clemson        ACC        574333 572262 588266 566787 565412 562799
10 Colorado       Pac-12     230778 226670 236331 279652 282335 274852
# ℹ 41 more rows

But that’s a lot of repetitive code. And a lot of typing. And typing is the devil. So what if we could create a list and pass it into the filter? It’s pretty simple.

We can create a new variable – remember variables can represent just about anything – and create a list. To do that we use the c operator, which stands for concatenate. That just means take all the stuff in the parenthesis after the c and bunch it into a list.

Note here: text is in quotes. If they were numbers, we wouldn’t need the quotes.

powerfive <- c("SEC", "Big Ten", "Pac-12", "Big 12", "ACC")

Now with a list, we can use the %in% operator. It does what you think it does – it gives you data that matches things IN the list you give it.

attendance |> filter(Conference %in% powerfive)

# A tibble: 65 × 8
   Institution    Conference `2013` `2014` `2015` `2016` `2017` `2018`
   <chr>          <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Alabama        SEC        710538 710736 707786 712747 712053 710931
 2 Arizona        Pac-12     285713 354973 308355 338017 255791 318051
 3 Arizona St.    Pac-12     501509 343073 368985 286417 359660 291091
 4 Arkansas       SEC        431174 399124 471279 487067 442569 367748
 5 Auburn         SEC        685252 612157 612157 695498 605120 591236
 6 Baylor         Big 12     321639 280257 276960 275029 262978 248017
 7 Boston College ACC        198035 239893 211433 192942 215546 263363
 8 California     Pac-12     345303 286051 292797 279769 219290 300061
 9 Clemson        ACC        574333 572262 588266 566787 565412 562799
10 Colorado       Pac-12     230778 226670 236331 279652 282335 274852
# ℹ 55 more rows

6.1 Selecting data to make it easier to read

So now we have our Power Five list. What if we just wanted to see attendance from the most recent season and ignore all the rest? Select to the rescue.

attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2018`)

# A tibble: 65 × 3
   Institution    Conference `2018`
   <chr>          <chr>       <dbl>
 1 Alabama        SEC        710931
 2 Arizona        Pac-12     318051
 3 Arizona St.    Pac-12     291091
 4 Arkansas       SEC        367748
 5 Auburn         SEC        591236
 6 Baylor         Big 12     248017
 7 Boston College ACC        263363
 8 California     Pac-12     300061
 9 Clemson        ACC        562799
10 Colorado       Pac-12     274852
# ℹ 55 more rows

If you have truly massive data, Select has tools to help you select fields that start_with the same things or ends with a certain word. The documentation will guide you if you need those someday. For 90 plus percent of what we do, just naming the fields will be sufficient.

6.2 Using conditional filters to set limits

Let’s return to the problem of one-hit wonders in basketball mucking up our true shooting analysis. How can we set limits in something like a question of who had the best season? Let’s grab every player from last season.

For this walkthrough:

Let’s get set up similar to the previous chapter.

players <- read_csv("data/players21.csv")

Rows: 5410 Columns: 58
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): Team, Player, Class, Pos, Height, Hometown, High School, Summary, ...
dbl (49): #, Weight, Rk.x, G, GS, MP, FG, FGA, FG%, 2P, 2PA, 2P%, 3P, 3PA, 3...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

players |>
  mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
  arrange(desc(trueshooting))

# A tibble: 5,410 × 59
   Team    Player   `#` Class Pos   Height Weight Hometown `High School` Summary
   <chr>   <chr>  <dbl> <chr> <chr> <chr>   <dbl> <chr>    <chr>         <chr>  
 1 Weber … Jake …    25 SO    F     6-7       185 Star, ID Eagle (ID)    3.0 Pt…
 2 Utah S… Karso…    21 FR    G     6-4       175 Celina,… Celina HS     0.4 Pt…
 3 UNLV R… Trey …     2 JR    G     5-11      175 Henders… Coronado (NV) 1.0 Pt…
 4 Missis… Justi…    21 JR    G     6-5       215 Birming… Hoover (AL)   3.0 Pt…
 5 Louisv… Hogan…     3 SO    C     6-8       235 Zionsvi… Zionsville HS 1.0 Pt…
 6 Gonzag… Matth…    23 JR    G     6-3       185 Portlan… Jesuit HS     0.5 Pt…
 7 Fresno… Alec …    20 JR    G     6-4       190 Riversi… Junipero Ser… 3.0 Pt…
 8 ETSU B… Morre…    22 FR    G     6-2       210 Clevela… Cleveland HS  0.6 Pt…
 9 UConn … Andre…    20 FR    G     6-1       175 Glaston… St. Anthony … 0.8 Pt…
10 Bellar… Hunte…    23 <NA>  G     6-4       185 Mount W… Bullitt East… 0.8 Pt…
# ℹ 5,400 more rows
# ℹ 49 more variables: Rk.x <dbl>, G <dbl>, GS <dbl>, MP <dbl>, FG <dbl>,
#   FGA <dbl>, `FG%` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `3P` <dbl>,
#   `3PA` <dbl>, `3P%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>, ORB <dbl>,
#   DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>,
#   PTS <dbl>, Rk.y <dbl>, PER <dbl>, `TS%` <dbl>, `eFG%` <dbl>, `3PAr` <dbl>,
#   FTr <dbl>, PProd <dbl>, `ORB%` <dbl>, `DRB%` <dbl>, `TRB%` <dbl>, …

And ine the COVID season, Weber State’s Jake Furgerson is your One Shot One Three True Shooting champion.

In most contests, like the batting title in Major League Baseball, there’s a minimum number of X to qualify. In baseball, it’s at bats. In basketball, it attempts. So let’s set a floor and see how it changes. What if we said you had to have played 100 minutes in a season? The top players in college basketball play more than 1000 minutes in a season. So 100 is not that much. Let’s try it and see.

players |>
  mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
  arrange(desc(trueshooting)) |>
  filter(MP > 100)

# A tibble: 3,490 × 59
   Team    Player   `#` Class Pos   Height Weight Hometown `High School` Summary
   <chr>   <chr>  <dbl> <chr> <chr> <chr>   <dbl> <chr>    <chr>         <chr>  
 1 South … Natha…    14 JR    F     6-6       189 Murfree… Blackman HS   1.7 Pt…
 2 Idaho … Emmit…     4 JR    G     6-4       195 Lapwai,… Lapwai (ID)   3.8 Pt…
 3 Sacram… Spenc…    34 SR    F     6-6       195 Woodbri… Lodi (CA)     2.2 Pt…
 4 LSU Fi… Mwani…     0 FR    F     6-5       215 Las Veg… Bishop Gorma… 3.6 Pt…
 5 Sacred… Mike …     3 FR    G     6-0       170 Hicksvi… Holy Trinity… 8.0 Pt…
 6 Northe… Bryce…    21 FR    F     6-7       205 Allen, … Allen HS      2.9 Pt…
 7 Quinni… Seth …     4 SO    C     7-1       200 Warmins… Archbishop W… 8.0 Pt…
 8 Utah U… Lahat…    32 SO    F     6-10      243 Dakar, … Florida Prep  1.4 Pt…
 9 Colgat… Olive…    33 JR    G     6-2       190 Chapel … Oak Hill Aca… 3.7 Pt…
10 Akron … Enriq…    25 SO    F     6-7       205 Clevela… St Martin De… 7.9 Pt…
# ℹ 3,480 more rows
# ℹ 49 more variables: Rk.x <dbl>, G <dbl>, GS <dbl>, MP <dbl>, FG <dbl>,
#   FGA <dbl>, `FG%` <dbl>, `2P` <dbl>, `2PA` <dbl>, `2P%` <dbl>, `3P` <dbl>,
#   `3PA` <dbl>, `3P%` <dbl>, FT <dbl>, FTA <dbl>, `FT%` <dbl>, ORB <dbl>,
#   DRB <dbl>, TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>,
#   PTS <dbl>, Rk.y <dbl>, PER <dbl>, `TS%` <dbl>, `eFG%` <dbl>, `3PAr` <dbl>,
#   FTr <dbl>, PProd <dbl>, `ORB%` <dbl>, `DRB%` <dbl>, `TRB%` <dbl>, …

Now you get South Carolina’s Nathan Nelson, who played in 15 games and was on the floor for 113 minutes. So he played regularly, but not a lot. But in that time, he only attempted 14 shots, and made 64 percent of them. In other words, when he shot, he probably scored. He just rarely shot.

So is 100 minutes our level? Here’s the truth – there’s not really an answer here. We’re picking a cutoff. If you can cite a reason for it and defend it, then it probably works.

6.3 Top list

One last little dplyr trick that’s nice to have in the toolbox is a shortcut for selecting only the top values for your dataset. Want to make a Top 10 List? Or Top 25? Or Top Whatever You Want? It’s easy.

So what are the top 10 Power Five schools by season attendance. All we’re doing here is chaining commands together with what we’ve already got. We’re filtering by our list of Power Five conferences, we’re selecting the three fields we need, now we’re going to arrange it by total attendance and then we’ll introduce the new function: top_n. The top_n function just takes a number. So we want a top 10 list? We do it like this:

attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2018`) |> arrange(desc(`2018`)) |> top_n(10)

Selecting by 2018

# A tibble: 10 × 3
   Institution Conference `2018`
   <chr>       <chr>       <dbl>
 1 Michigan    Big Ten    775156
 2 Penn St.    Big Ten    738396
 3 Ohio St.    Big Ten    713630
 4 Alabama     SEC        710931
 5 LSU         SEC        705733
 6 Texas A&M   SEC        698908
 7 Tennessee   SEC        650887
 8 Georgia     SEC        649222
 9 Nebraska    Big Ten    623240
10 Oklahoma    Big 12     607146

That’s all there is to it. Just remember – for it to work correctly, you need to sort your data BEFORE you run top_n. Otherwise, you’re just getting the first 10 values in the list. The function doesn’t know what field you want the top values of. You have to do it.