Mike Caulfield wrote a very thorough and quite entertaining article about posing the following question to ChatGPT:
What were Marisa Tomei’s most famous quotes from My Cousin Vinny and what was the context?
Depending on the model selected, the answers to this varied from hilariously wrong, to plausible-but-flawed, to accurate.
Interestingly, substantial test-time compute (“thinking time”) seems to be necessary to do a good job here, despite the easy availability online of famous quotes, plot summaries, and even the script. While the fast-response models available for free were prone to hallucinate.
At the same time I was struck just how much reasoning time needed to be expended to get this task right. It’s possible that My Cousin Vinny is uniquely hard to parse, but I don’t think that is the case. I’ve tried this with a half dozen other films and the pattern seems to hold. If it’s true that a significant amount of similar film contextualization tasks are solvable with test-time compute but require extensive compute to get it right, it seems to me this could be the basis of a number of useful benchmarks.
The full article is well-worth reading, and not only because it discusses My Cousin Vinny in substantial detail (great movie).