I have not been doing the chain of reasoning. I am interested in exploring it, but to my mind it is kind of cheating. You are getting the questioner to do most of the reasoning.
In fact I see a lot of naive examples of people doing prompt engineering and modified variations of the reasoning chain. Tinkering until the model gives the right answer -- and then declaring 'look the LLM can reason!'
I think we really need an independent evaluation on a novel dataset to settle this.