This is a new artifact, a collection of prompts that challenge the reasoning abilities of large language models. It tests whether LLMs will pick up on slight changes to very commonly known problems, and sees the effect of high occurrence in training data.

People in the comments recommend additional questions they have experimented with that are similar in style to what the OP is testing with LLMs. People also report on their models with custom system prompts that managed to answer the questions correctly.

People also speculate in the comments why this happens, and the biggest driver of the conversation is of course, that the model is overfitting. Others refute this argument, by saying that drawing conclusions from overfit samples is an unfair assessment to the LLM, and that overgeneralizations are dangerous.

Still another super interesting speculation is that the LLM isn’t not capable of solving the problem, but is assuming that the user made the typo or misremembered the scenario. They tested the prompt “interpret this question 100% literally (there are no mistakes in it)”, and it answered correctly.

What we are seeing here are debates, which isn’t something I’ve seen before with these posts. And they are related to the architecture of models, has nothing to do with localness but is still interesting.