This is a article digest for Long-Form Factuality in Large Language Models

Summary

  1. They created the LongFact dataset by picking 38 categories, and asking GPT-4 to generate a few dozen questions for each. For example, in Biology, one generated question is “What is the role of telomeres in cellular aging and how may they influence the aging process of an organism?”
  2. They then ask the various state of the art LLMs to answer the questions generated from the LongFact prompts, and then score these with SAFE. In a way, this is a benchmark for modern LLMs.
  3. The SAFE (Search-Augmented Factuality Evaluator) algorithm scores how good a long form answer is to a given question.
    1. Split text into sentences using simple python code.
    2. Split each sentence into atomic facts by prompting GPT-3.5-T
    3. Pass these atomic facts to a GPT-3.5-T along with a few-shot example prompt to make them “self-contained” and understandable if read alone. Mostly this is about replacing pronouns by referencing the article context.
    4. Pass these self-contained atomic facts to GPT-3.5-T inside a few shot example prompt to determine if the atomic fact is relevant to the question.
    5. For each relevant atomic fact, ask GPT-3.5-T to generate 5 google search queries. Execute these and read the top three hits on each. (I believe it only reads the Google extract context from the page, and doesn’t do a full page load)
    6. Feed the relevant atomic fact, and the Google hits, into a final prompt to GPT-3.5-T. The prompt has some simple chain of reasoning. It asks the model to decide if the atomic fact is “supported” or “not supported” by the Google hits.

The reason they lean on GPT-3.5-Turbo for SAFE is cost. It’s 19 cents to evaluate one long form reply. GPT-4-Turbo costs 20x more (4). The paper has all the prompts in the appendix, and I found it interesting to see what prompting techniques insiders at DeepMind endorse.

As for the performance of each LLM at answering the LongForm questions with supporting facts (that is, answering questions with statements which SAFE decides are supported by Google Search)… GPT-4-T wins. Gemini-Ultra is next, and PaLM-2 after that. Claude-3 trails in fourth.

What’s much more interesting to me is the performance of SAFE. They claim it agrees with humans 72% of the time, and for disagreements, is found by expert judges to be right 76% of those times. Neither are right 5% of the time.

If I eyeball this, I conclude SAFE agrees with expert judges (on a given factoid) ~94% of the time. Not bad! I can imagine an expensive LLM agent which first answers a question, and then runs SAFE over its result. This would let it amend non-supported facts and annotate the supported ones with footnote URLs.

Personally, I would use this when researching legal or compliance issues: tax deductions, international shipping rules, how hypothetical situations might play out with a given insurance contract, etc. For these kinds of questions, I want cited sources, and I wouldn’t mind paying 19 cents to have them.

I’m less sure it could be relied on to check news articles, given news articles themselves could show up in google search results and act as their own support. And I believe when the news lies to you, it is usually a frame issue instead of a factual error, but that is another can of worms.