Pop science indeed. Nothing new here. The Turing Test was the product of a much earlier era. Our machines today can easily fake a conversation, but there's been little progress in defining what intelligence is, let alone consciousness. Whatever they are, it's clear that LLMs don't have them, and aren't on track to produce them.
For many of us a better Turing test is contextual to a topic we CARE about. Lots of LLMs sound better than a randomly sampled human on a topic I don't know too much about (e.g. opinions on new movies). They're decent on engineering topics I only vaguely know about, but still below the bar (though getting better!) on topics I really care about.
Jones and his team performed this experiment with four LLMs. ChatGPT 4.5 was by far the most successful: 73% of participants identified it as the real human. Another model that goes by the unwieldy name LLaMa-3.1-405B was identified as human 56% of the time. (The other two models—ELIZA and GPT-4o—achieved 23% and 21% success rates, respectively, and will not be spoken of again.)
By ELIZA, are they referring to the classic ELIZA? I am not aware of anything new and current with the same name?
If the old ELIZA succeeded 23% of the time, in the context of the other numbers ... that seems ... odd.