Thanks @Rajiv this is fantastic
My concern with the current generation of models is whether they have been overfit on these benchmark datasets. Effectively learning to respond to the patterns in the benchmark, but not generalising.
Whenever I craft novel reasoning tasks for ChatGPT it fails miserably. I should try them on these other models.