A simple example of AI agents(?) doing journalism(?) work

code
analysis
AI
Author

Matt Waite

Published

October 23, 2024

Let’s start this with some confessions:

Something I’ve been coming around to, though, is maybe there isn’t a Manhattan Project level world changing use case for AI in journalism. Maybe Chris Albon has the right of it, that the real value is AI saving a human an hour of work … millions of times a day.

For months I’ve been trying to think of some moon-shot idea to use AI to do … something, anything … big. And every thing I came up with would be terrible, destructive, or flat-out-insane to deploy without massive human investment, and then what would the point be?

And then, randomly, one day a confluence of ideas popped into my head and what fell out is an example of AI agents(?) doing the work of journalism(?) that actually works.

The ideas that collided in my brain were:

  1. This Francesco Marconi tweet from April that I think did the best job of laying out a vision for AI agents in journalism. At least it’s the one that made the most sense to me.
  2. This R Package wrapping the Google Gemini API.
  3. Gemini having a free tier to try some stuff out. You get 15 requests a minute – one every 4 seconds – and 1,500 a day.
  4. The “aim small, miss small” mantra in teaching marksmanship.

Why R and not Python?

The honest truth is there is no good reason why I’m using R to do this vs Python, which most of the AI world is using. The reason is because I teach Data Journalism to undergrads using R at the Harvard of the Plains and believe strongly that I can take absolute beginners – people who can’t spell code – from zero to data analysis faster with the R and the Tidyverse than I can with Python and Pandas. So I have a hammer, this here looks like a nail, and so we’re doing this with R. But there’s absolutely nothing special about this code that you couldn’t match in Python.

When I teach Data Journalism, I use population estimates from the US Census Bureau to teach students how to calculate percent change. That way, we can see which counties in the state grew the fastest and who shrank the fastest. It’s a story as old as time in the Great State of Nebraska. Rural areas are shrinking, urban areas are growing.

An extremely common thing to do with this data is to make a map. Here’s the 2022-2023 change map in Datawrapper for Nebraska.

If you click on a county, you’ll get a pop up box that gives you the county name and then the data. Population in 2023, population in 2022 and the percent change. It’s been done a million times before. It does the job.

But what if we could make it better? What if we had small narrative summaries for each county? An a headline for each one? Can I assign a human to do this? 100 percent I can. I have an army of undergrads and a gradebook to hold over them. I assure you, if I was cruel enough, we could do this for all 3,400 counties in the US.

But why do that when we can have AI do this in minutes instead of making humans miserable for hours?

Feeding Gemini numbers, getting back narrative

The gemini.R package couldn’t make sending something to Gemini any more simple. The hardest part – which is not hard – is getting an API key from Google. How do you do that? Go here: https://makersuite.google.com/app/apikey. So I always have mine and don’t lose it, I set it as an environment variable. To do that, run usethis::edit_r_environ() and add GOOGLE_GEMINI_KEY="your key here" to your environment, save that file and restart R Studio (or whatever IDE you use).

You can test it out with something like this:

library(tidyverse)
library(gemini.R)
library(glue)

setAPI(Sys.getenv("GOOGLE_GEMINI_KEY"))

gemini("Write me a haiku about the joy and sadness of being a Nebraska football fan")

What do you get back?

Red and white, we cheer,
Hope springs eternal, then fades,
Another close loss.

Ouch.

But that’s really it. Just gemini("Words here") and off it goes to an AI and back comes the results in plain text. So the first hard part is turning data into a text block we can send to Gemini. So I pull a dataset of Nebraska county population estimates, I’m going to thin the number of columns I’m working with first, then create the percent change column, create a GEOID column made up of the state and county FIPS number so I can join it to my map later, rank the counties by population change and then mash it all together into a single text blob called the base_narrative. It’s literally just Column Name: Number, Column Name: Number repeated over and over.

countypopchange <- read_csv("https://the-art-of-data-journalism.github.io/tutorial-data/census-estimates/nebraska.csv")

statenarrative <- countypopchange |> 
  select(COUNTY, STATE, CTYNAME, STNAME, POPESTIMATE2023, POPESTIMATE2022, NPOPCHG2023, NATURALCHG2023, NETMIG2023) |>
  mutate(POPPERCENTCHANGE = ((POPESTIMATE2023-POPESTIMATE2022)/POPESTIMATE2022)*100) |> 
  mutate(GEOID = paste0(COUNTY, STATE)) |> 
  arrange(desc(POPPERCENTCHANGE)) |> 
  mutate(PCTCHANGERANK = row_number()) |> 
  mutate(base_narrative = glue(
  "County: {CTYNAME}, Population in 2023: {POPESTIMATE2023}, Population in 2022: {POPESTIMATE2022}, Population change: {NPOPCHG2023}, Percent change: {POPPERCENTCHANGE}, Percent change rank in {STNAME}: {PCTCHANGERANK}, Natural change (births vs deaths): {NATURALCHG2023}, Net migration: {NETMIG2023}")) 

What does a base_narrative look like?

County: Banner County, Population in 2023: 674, Population in 2022: 657, Population change: 17, Percent change: 2.58751902587519, Percent change rank in Nebraska: 1, Natural change (births vs deaths): 0, Net migration: 17

A real exciting read, no?

Now we need to make an agent. It helped me to think of journalism as an assembly line. We have data as raw materials, and now we need to assign a worker to a process to convert raw material into something new. In any news process, the first worker is the journalist creating the thing. A reporter going to city hall. A photographer going to a breaking news event. So it makes sense that our first agent is the author of the narratives for each county.

Like any good LLM prompt, we’re going to start by giving our author agent a role – a job to do. Our agent is a demographics journalist from Nebraska. We give that agent a task with some details – keep it short but approachable. And then we give it the data.

Below the role is a function that takes in a county name, finds the base_narrative for that county and then we merge together the author_role and the base_narrative when we send it to Gemini. We store the results in a variable called … results and do a little cleanup on it (Gemini likes to cram newline characters in the results). I’ve added a five second sleep between every county to keep from running afoul of Google’s free tier API limits. Theoretically I should be able to set it to four seconds, but I don’t need my account banned over a second. With those pauses, our author takes about eight minutes to write small narratives about each of the 93 counties.

author_role <- "You are a demographics journalist from Nebraska. Your job today is to write short -- 2-3 sentence -- summaries of population estimates from the Census Bureau for each county in the state. I will provide you the name of the county and a series of population numbers for the county. Your job is to turn it into a concise but approachable summary of what happened in that county. Here is the data you have to work with: "

author_agent <- function(county) {
  county_narrative <- statenarrative |> filter(CTYNAME == county)
  results <- gemini(paste(author_role, county_narrative$base_narrative))
  results <- gsub("\n", "", results)
  Sys.sleep(5)
  print(paste("Processed", county_narrative$CTYNAME))
  
  # Return a single-row tibble
  tibble(county = county_narrative$CTYNAME, base_narrative = county_narrative$base_narrative, capsule_narrative = results)
}

# Use map_df to directly create the final dataframe
author_agent_results <- purrr::map_df(statenarrative$CTYNAME, author_agent)

When it’s done, we end up with a dataframe called author_agent_results which has the county name and the new narrative.

What does it look like? Remember, we gave it this:

County: Banner County, Population in 2023: 674, Population in 2022: 657, Population change: 17, Percent change: 2.58751902587519, Percent change rank in Nebraska: 1, Natural change (births vs deaths): 0, Net migration: 17

And we got back:

Banner County saw a significant population increase in 2023, growing by 17 people for a 2.6% jump, the highest rate of growth in the state. This growth was entirely due to an influx of new residents, as the county experienced no natural population change.

We could stop here, but why do that? Good journalism is often a layered process involving multiple sets of eyes reviewing the work along the way, and other skilled people adding to the product. So who would normally get this next? How about we fact check the work?

Exact same pattern, just a different role for the AI.

fact_check_role <- "You are a fact-checking editor. Your job today is to compare the information in the base narrative to the capsule narrative in the data provided to you. You will compare each number in the base narrative to the capsule narrative to make sure they are the same, and then you will check the context of how each number was used in comparison to the original base narrative. To be clear: the base narrative is the correct information. When you are finished, return just a single word. If everything is correct, respond Correct. If any part of it is not correct, respond Incorrect. "

fact_check_agent <- function(county_input) {
  county_narrative <- author_agent_results |> filter(county == county_input)
  input_string <- paste(
    fact_check_role,
    "Base narrative:", county_narrative$base_narrative,
    "Capsule narrative:", county_narrative$capsule_narrative)
  results <- gemini(input_string)
  results <- gsub("\n", "", results)
  Sys.sleep(5)
  print(paste("Processed", county_input))

  # Return a single-row tibble
  tibble(county = county_narrative$county, base_narrative = county_narrative$base_narrative, capsule_narrative = county_narrative$capsule_narrative, fact_check_results = results)
}

# Use map_df to directly create the final dataframe
fact_check_agent_results <- purrr::map_df(author_agent_results$county, fact_check_agent)

Honestly, this part needs work. It flags about 10 percent of the results as being incorrect, but they aren’t. The reason it flags them is the author agent was a little glib with the numbers – saying “a slight increase natural growth because of more births” without giving the numbers themselves. The fact checking editor bot here does not like that. So it might need a little tweaking to see if we can get the fact checker to lighten up a bit.

Quit now? Nah. How about we add a little local flair to each capsule. Google knows a lot of stuff, so why not add some kind of geographic context to each county. To do that, I created a rewrite editor and commanded it to add details like the region of the state, the county seat, the largest city, or some other detail to a single clause in the original narrative.

rewrite_role <- "You are a re-write editor. Your job is to add a little local geographic context to a demographic capsule about a county in Nebraska. You'll do this by adding a clause to the paragraph I'll provide you. That clause should tell you something about that county. Maybe the county seat, or the largest city, or the region of the state it is in. We just need a clause added to one of the sentences, and do not do anything to show where you added it. Here is the capsule: "

rewrite_agent <- function(county_input) {
  county_narrative <- fact_check_agent_results |> filter(county == county_input)
  results <- gemini(paste(rewrite_role, county_narrative$capsule_narrative))
  results <- gsub("**", "", results, fixed = TRUE)
  Sys.sleep(5)
  print(paste("Processed", county_narrative$county))
  
  # Return a single-row tibble
  tibble(county = county_narrative$county, base_narrative = county_narrative$base_narrative, capsule_narrative = county_narrative$capsule_narrative, rewrite_county_narrative = results)
}

# Use map_df to directly create the final dataframe
rewrite_agent_results <- purrr::map_df(fact_check_agent_results$county, rewrite_agent)

What did it give us for Banner County?

Banner County saw a significant population increase in 2023, growing by 17 people for a 2.6% jump, the highest rate of growth in the state. This growth was entirely due to an influx of new residents, as the county experienced no natural population change, likely due to its location in the sparsely populated northwestern corner of the state.

Hmmm. Is the fact that there were the same number of births and deaths caused by it’s location in the northwest corner of the state? Debatable. And, honestly, this re-write bot has been the source of the most questions I have about this whole enterprise. We’ll talk more about that below.

One last thing: Why not give every county a headline instead of just simply having the county name in the map? Done and done.

headline_writer_role <- "You are a headline writer. Your job is to write a short headline based on the summary given to you. This headline should be short -- it has to fit into a small space -- so bear that in mind. Here is the capsule: "

headline_agent <- function(county_input) {
  county_narrative <- rewrite_agent_results |> filter(county == county_input)
  results <- gemini(paste(headline_writer_role, county_narrative$rewrite_county_narrative))
  results <- gsub("\n", "", results, fixed = TRUE)
  Sys.sleep(5)
  print(paste("Processed", county_narrative$county))
  
  # Return a single-row tibble
  tibble(county = county_narrative$county, base_narrative = county_narrative$base_narrative, capsule_narrative = county_narrative$capsule_narrative, rewrite_county_narrative = county_narrative$rewrite_county_narrative, headline = results)
}

# Use map_df to directly create the final dataframe
headline_agent_results <- purrr::map_df(rewrite_agent_results$county, headline_agent)

And what does that look like in Banner County?

Banner County Booms: 2.6% Growth Fueled by New Residents

Here is the exact same map, same data, but now with headlines and narratives written for each county. Click on one and you’ll get a human-friendly narrative about that county, and the numbers below that if you want them.

Is Christopher Nolan going to make a movie about this moment? Not hardly. Have I “Saved Journalism”? Nope. Not even close. Did I save a human a few minutes of drudgery 93 + 93 + 93 + 93 times? Yes. Yes I did. Is this idea extensible and repeatable? It sure is. I think that’s good enough for now.

Where to go from here

From here, what this needs is a block of code that pushes the results of the headline editor agent to Google Sheets, where a human can go through each one and make sure everything is fine. I’ve tried to do that in R Studio, which is not great, and limits the number of people who could do this. And to be sure, it needs to be done. Nebraska geography nerds – there are tens of us! – will able to find Arthur County, one of the smallest counties in the US by population. For those who can’t, the rewrite bot describes Arthur as “located in the northwestern corner of the state and home to the county seat of Arthur.” Half of that is objectively true. Arthur is much more arguably in the west-central part of the state, but it really depends on where you draw the lines. It’s arguable enough that a good local editor would drop that part and leave the county seat part, which is correct. Arthur County’s county seat is … Arthur.

Then, as a last step, I would automate the process of creating the Datawrapper chart with the DatawRapper library, which accesses the service’s API. Imagine the Census Bureau publishing the data, and then in a matter of minutes you have that map done and online. To do that, you’d have to have a little budget for API calls so you aren’t waiting 32 minutes for the whole thing to run – 8 + 8 + 8 + 8 minutes for each step. But even then, is there really a competitive contest over who can publish Census maps in Nebraska fast enough? No.

But if you, like me, are thinking about how we’re going to incorporate AI into journalism in ways that make sense; doesn’t risk the reputation of the organization you work for; and likely doesn’t horrify your audience to a point that they turn away from you for feeding them AI slop, this is an example of just how to do that.

Aim small, miss small.