Comparing Automatic Metrics with Human Perceptions: the Ethical Challenges of Evaluating the Human-likeness of Chatbots

Chatbots have become increasingly human-like, making it difficult for people to distinguish their outputs from those of people.
Recently, an automated metric has been proposed to measure the level of anthropomorphism in the outputs of large language
models (LLMs). However, it is not clear how well suited this is to assessing the human-likeness of conversational systems,
such as chatbots. We build three LLM-based chatbots, and conduct a pilot user study to assess participants’ perceptions of
their personification, comparing the results with the automated scores. We find that the human and automated measures are
not well aligned, and that the levels of anthropomorhpism of the three systems vary.

Cite

Citation style:
Could not load citation form.

Rights

Use and reproduction:
This work may be used under a
CC BY 4.0 LogoCreative Commons Attribution 4.0 License (CC BY 4.0)
.