A Reflective View on Text Similarity
While the concept of similarity is well grounded in psychology, text similarity is less well-defined. Thus, we analyze text similarity with respect to its definition and the datasets used for evaluation. We formalize text similarity based on the geometric model of conceptual spaces along three dimensions inherent to texts: structure, style, and content. We empirically ground these dimensions in a set of annotation studies, and categorize applications according to these dimensions. Furthermore, we analyze the characteristics of the existing evaluation datasets, and use those datasets to assess the performance of common text similarity measures.