Quoting Nicholas Carlini

Because when the people training these models justify why they’re worth it, they appeal to pretty extreme outcomes. When Dario Amodei wrote his essay Machines of Loving Grace, he wrote that he sees the benefits as being extraordinary: “Reliable prevention and treatment of nearly all natural infectious disease … Elimination of most cancer … Prevention of Alzheimer’s … Improved treatment of most other ailments … Doubling of the human lifespan.” These are the benefits that the CEO of Anthropic uses to justify his belief that LLMs are worth it. If you think that these risks sound fanciful, then I might encourage you to consider what benefits you see LLMs as bringing, and then consider if you think the risks are worth it.

From Carlini’s recent talk/article on Are large language models worth it?

The entire article is well worth reading, but I was struck by this bit near the end. LLM researchers often dismiss (some of) the risks of these models as fanciful. But many of the benefits touted by the labs sound just as fanciful!

When we’re evaluating the worth of this research, it’s a good idea to be consistent about how realistic — or how “galaxy brain” — you want to be, with both risks and benefits.

“My Cousin Vinny” as an LLM benchmark

Mike Caulfield wrote a very thorough and quite entertaining article about posing the following question to ChatGPT:

What were Marisa Tomei’s most famous quotes from My Cousin Vinny and what was the context?

Depending on the model selected, the answers to this varied from hilariously wrong, to plausible-but-flawed, to accurate.

Interestingly, substantial test-time compute (“thinking time”) seems to be necessary to do a good job here, despite the easy availability online of famous quotes, plot summaries, and even the script. While the fast-response models available for free were prone to hallucinate.

At the same time I was struck just how much reasoning time needed to be expended to get this task right. It’s possible that My Cousin Vinny is uniquely hard to parse, but I don’t think that is the case. I’ve tried this with a half dozen other films and the pattern seems to hold. If it’s true that a significant amount of similar film contextualization tasks are solvable with test-time compute but require extensive compute to get it right, it seems to me this could be the basis of a number of useful benchmarks.

The full article is well-worth reading, and not only because it discusses My Cousin Vinny in substantial detail (great movie).