I’ve written before, I am at best an enthusiastic amateur when it comes to AI, LLMs and R. But I’m braver/dumber than most, so for a talk I’m giving to NE-RUG – the Nebraska R Users Group – and to the NICAR data journalism conference, I’ve got some resources and some code to share.
R libraries
ellmer: From the folks who brought you the tidyverse comes ellmer, a library that supports talking to a large number of LLMs. To talk to the big commercial LLMs, you’ll need API keys, and that usually means having an API budget. But what I like about ellmer is that it also talks to locally hosted models as well. More about that later.
chores: Built on top of ellmer, chores is a neat way to make tools inside of R Studio that leverage LLMs to accomplish tasks. Examples are to help with certain kinds of code tasks, or help with explanations of what an error message means.
External resources
Ollama: A cross-platform system of downloading, managing and running open-source LLMs on your local machine. With Ollama, you can run Meta’s Llama3 or DeepSeek’s R1 locally, using them to accomplish tasks without incurring costs. A rough rule of thumb is that you can run models slightly smaller than the amount of RAM you have. For example, my computer has 16GB of RAM, so I can run 14 billion parameter models (albeit somewhat slowly).
Basics of ellmer
ellmer at first is very simple. You can just start a chat and … chat.
library(ellmer)library(tidyverse)library(glue)chat <-chat_gemini()chat$chat("Tell me three jokes about journalists")
1. Why did the journalist get fired from the newspaper? Because he kept
writing his own obituaries.
2. A journalist walks into a bar and orders a drink. The bartender says, "Hey,
aren't you the guy who wrote that article about me?" The journalist replies,
"I don't remember. What was it about?"
3. What's the difference between a journalist and a pizza? A pizza can feed a
family.
Structured data output
Most models can be used to extract structure from sentences.
chat <-chat_gemini()data <- chat$extract_data("My name is Matt and I'm a 49 year old Journalism major",type =type_object(age =type_number(),name =type_string(),major =type_string() ))as.data.frame(data)
age name major
1 49 Matt Journalism
Something more like data journalism
What if we wanted to normalize some messy, messy data. In Nebraska, the Department of Corrections publishes a live dataset of incarcerated people that you can download. You can ask and answer all kinds of questions from it – demographics, etc. But what you can’t do is figure out what is holding the most people, or how many people are there at least in part because of drug charges. Why? Because the charges they are serving time for are not normalized. Here’s an example of what they look like:
POS CNTRL SUB-METHAMPHETAMINE
MURDER 1ST DEGREE
POSSESSION OF METHAMPHETAMINE
POS CNTRL SUB (METH)
POSSESS CONTR SUBSTANCE-METH
DELIVERY OF METHAMPHETAMINE
POSS W/ INTENT DIST METH
$methamphetamine_related
[1] TRUE
$drug_possession_related
[1] TRUE
$drug_distribution_related
[1] TRUE
$fully_spelled_out_no_abbreviations
[1] "Possession with Intent to Distribute Methamphetamine"
* you have to ruthlessly check this. It’s good, but it is not perfect.
Why am I only doing one here? Because the free tier of Gemini limits you to about 5 queries a minute. But it wouldn’t be that hard to write a function that runs through your list of charges and inserts a pause after each one to ensure you stay in the free tier.
What else can be done?
Nuclear powered population maps
Here is a basic population map of Nebraska with one-year change values in it. We’ve all made this chart before. It’s simple, but it gets the job done.
I’ve written a lot more about this process here but what if we could take the basic data from our population data and create narratives out of it. Instead of the boring tabular way of giving them the information, they get a human narrative of a few sentences.
First, let’s create the base narrative. That is the words we will feed to Gemini.
countypopchange <-read_csv("https://the-art-of-data-journalism.github.io/tutorial-data/census-estimates/nebraska.csv")statenarrative <- countypopchange |>select(COUNTY, STATE, CTYNAME, STNAME, POPESTIMATE2023, POPESTIMATE2022, NPOPCHG2023, NATURALCHG2023, NETMIG2023) |>mutate(POPPERCENTCHANGE = ((POPESTIMATE2023-POPESTIMATE2022)/POPESTIMATE2022)*100) |>mutate(GEOID =paste0(COUNTY, STATE)) |>arrange(desc(POPPERCENTCHANGE)) |>mutate(PCTCHANGERANK =row_number()) |>mutate(base_narrative =glue("County: {CTYNAME}, Population in 2023: {POPESTIMATE2023}, Population in 2022: {POPESTIMATE2022}, Population change: {NPOPCHG2023}, Percent change: {POPPERCENTCHANGE}, Percent change rank in {STNAME}: {PCTCHANGERANK}, Natural change (births vs deaths): {NATURALCHG2023}, Net migration: {NETMIG2023}")) statenarrative$base_narrative[[1]]
County: Banner County, Population in 2023: 674, Population in 2022: 657, Population change: 17, Percent change: 2.58751902587519, Percent change rank in Nebraska: 1, Natural change (births vs deaths): 0, Net migration: 17
Now, we’re going to add a system prompt to our chat to give Gemini some guidance. It’s not hard to do and you should do it.
chat <-chat_gemini(system_prompt ="You are a demographics journalist from Nebraska. Your job today is to write short -- 2-3 sentence -- summaries of population estimates from the Census Bureau for each county in the state. I will provide you the name of the county and a series of population numbers for the county. Your job is to turn it into a concise but approachable summary of what happened in that county. Here is the data you have to work with: ")chat$chat(statenarrative$base_narrative[[1]])
Banner County experienced a healthy growth spurt in 2023, adding 17 residents
for a total population of 674. This 2.6% increase, the highest in Nebraska, was
driven entirely by net migration, as births and deaths balanced each other out.
Earth-shattering? Hardly. But what if instead of world changing uses of technology, the right aim for AI is to offload tasks we would do because it would be better, but aren’t worth spending the time doing because we all have a limited time on this earth?
If you were to run this 93 times – or, you know, make a function to do that – and added a headline writing bot to this, here’s what your map now looks like.