Hostname: page-component-797576ffbb-k7d4m Total loading time: 0 Render date: 2023-12-08T09:50:12.757Z Has data issue: false Feature Flags: { "corePageComponentGetUserInfoFromSharedSession": true, "coreDisableEcommerce": false, "useRatesEcommerce": true } hasContentIssue false

A study of the evaluation metrics for generative images containing combinational creativity

Published online by Cambridge University Press:  23 March 2023

Boheng Wang
Dyson School of Design Engineering, Imperial College London, London, UK
Yunhuai Zhu
Zhejiang–Singapore Innovation and AI Joint Research Lab, Zhejiang University, Hangzhou, China
Liuqing Chen*
International Design Institute, Zhejiang University, Hangzhou, China
Jingcheng Liu*
International Campus, Zhejiang University, Hangzhou, China
Lingyun Sun
International Design Institute, Zhejiang University, Hangzhou, China
Peter Childs
Dyson School of Design Engineering, Imperial College London, London, UK
Author for correspondence: Liuqing Chen, E-mail:
Author for correspondence: Liuqing Chen, E-mail:


In the field of content generation by machine, the state-of-the-art text-to-image model, DALL⋅E, has advanced and diverse capacities for the combinational image generation with specific textual prompts. The images generated by DALL⋅E seem to exhibit an appreciable level of combinational creativity close to that of humans in terms of visualizing a combinational idea. Although there are several common metrics which can be applied to assess the quality of the images generated by generative models, such as IS, FID, GIQA, and CLIP, it is unclear whether these metrics are equally applicable to assessing images containing combinational creativity. In this study, we collected the generated image data from machine (DALL⋅E) and human designers, respectively. The results of group ranking in the Consensual Assessment Technique (CAT) and the Turing Test (TT) were used as the benchmarks to assess the combinational creativity. Considering the metrics’ mathematical principles and different starting points in evaluating image quality, we introduced coincident rate (CR) and average rank variation (ARV) which are two comparable spaces. An experiment to calculate the consistency of group ranking of each metric by comparing the benchmarks then was conducted. By comparing the consistency results of CR and ARV on group ranking, we summarized the applicability of the existing evaluation metrics in assessing generative images containing combinational creativity. In the four metrics, GIQA performed the closest consistency to the CAT and TT. It shows the potential as an automated assessment for images containing combinational creativity, which can be used to evaluate the images containing combinational creativity in the relevant task of design and engineering such as conceptual sketch, digital design image, and prototyping image.

Research Article
Copyright © The Author(s), 2023. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Adams, SS, Banavar, G and Campbell, M (2016) I-athlon: towards a multidimensional turing test. AI Magazine 37, 7884.CrossRefGoogle Scholar
Amabile, TM (1982) Social psychology of creativity: a consensual assessment technique. Journal of Personality and Social Psychology 43, 997.CrossRefGoogle Scholar
Amabile, TM and Hennessey, B (1999) Consensual assessment. Encyclopedia of Creativity 1, 347359.Google Scholar
Amato, G, Behrmann, M, Bimbot, F, Caramiaux, B, Falchi, F, Garcia, A, Geurts, J, Gibert, J, Gravier, G, Holken, H and Koenitz, H (2019) AI in the media and creative industries. arXiv preprint arXiv:1905.04175.Google Scholar
Boden, MA (2004) The Creative Mind: Myths and Mechanisms. London: Psychology Press.CrossRefGoogle Scholar
Boden, MA (2010) The turing test and artistic creativity. Kybernetes. 39, 409413.CrossRefGoogle Scholar
Borji, A (2019) Pros and cons of GAN evaluation measures. Computer Vision and Image Understanding 179, 4165.CrossRefGoogle Scholar
Bringsjord, S, Bello, P and Ferrucci, D (2003) Creativity, the turing test, and the (better) lovelace test. In Moor, JH (ed.), The Turing Test: The Elusive Standard of Artificial Intelligence. Dordrecht: Springer Netherlands, pp. 215239.CrossRefGoogle Scholar
Brown, T, Mann, B, Ryder, N, Subbiah, M, Kaplan, JD, Dhariwal, P, Neelakantan, A, Shyam, P, Sastry, G, Askell, A and Agarwal, S (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 18771901.Google Scholar
Bujang, MA, Omar, ED and Baharum, NA (2018) A review on sample size determination for Cronbach's alpha test: a simple guide for researchers. The Malaysian Journal of Medical Sciences: MJMS 25, 85.CrossRefGoogle Scholar
Burnard, Pamela and Anne, Power (2013) Issues in conceptions of creativity and creativity assessment in music education. In Handbook of Research on Creativity. Cheltenham: Edward Elgar Publishing, pp. 212229.Google Scholar
Chen, L, Wang, P, Dong, H, Shi, F, Han, J, Guo, Y, Childs, PR, Xiao, J and Wu, C (2019) An artificial intelligence based data-driven approach for design ideation. Journal of Visual Communication and Image Representation 61, 1022.CrossRefGoogle Scholar
Chu, H, Urtasun, R and Fidler, S (2016) Song from PI: a musically plausible network for pop music generation. arXiv preprint arXiv:1611.03477.Google Scholar
Cropley, DH and Kaufman, JC (2013) Rating the creativity of products. In Handbook of Research on Creativity. Edward Elgar Publishing.Google Scholar
Davis, GA (1975) In frumious pursuit of the creative person. The Journal of Creative Behavior 9(2), 7587.CrossRefGoogle Scholar
Denson, C, Buelin, J, Lammi, M and D'Amico, S (2015) Developing instrumentation for assessing creativity in engineering design. Journal of Technology Education 27, 2340.Google Scholar
Diaconis, P and Graham, RL (1977) Spearman's footrule as a measure of disarray. Journal of the Royal Statistical Society: Series B (Methodological) 39, 262268.Google Scholar
Ding, M, Yang, Z, Hong, W, Zheng, W, Zhou, C, Yin, D, Lin, J, Zou, X, Shao, Z, Yang, H and Tang, J (2021) CogView: mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems 34, 1982219835.Google Scholar
Frolov, S, Hinz, T, Raue, F, Hees, J and Dengel, A (2021) Adversarial text-to-image synthesis: a review. Neural Networks 144, 187209.CrossRefGoogle ScholarPubMed
Gu, S, Bao, J, Chen, D and Wen, F (2020) GIQA: Generated Image Quality Assessment. Glasgow. Computer Vision–ECCV 2020: 16th European Conference, 2328.Google Scholar
Guo, J, Lu, S, Cai, H, Zhang, W, Yu, Y and Wang, J (2018) Long text generation via adversarial training with leaked information. Paper presented at the Proceedings of the AAAI Conference on Artificial Intelligence.CrossRefGoogle Scholar
Han, J (2018) Combinational creativity and computational creativity.Google Scholar
Heusel, M, Ramsauer, H, Unterthiner, T, Nessler, B and Hochreiter, S (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems 30, 66266637.Google Scholar
Kaufman, JC, Baer, J, Cole, JC and Sexton, JD (2008 a) A comparison of expert and nonexpert raters using the consensual assessment technique. Creativity Research Journal 20, 171178.CrossRefGoogle Scholar
Kaufman, JC, Plucker, JA and Baer, J (2008 b) Essentials of Creativity Assessment. Hoboken: John Wiley & Sons.Google Scholar
Kaufman, JC, Baer, J, Agars, MD and Loomis, D (2010) Creativity stereotypes and the consensual assessment technique. Creativity Research Journal 22, 200205.CrossRefGoogle Scholar
Kim, D-H (2019) Evaluation of coco validation 2017 dataset with yolov3. Evaluation 6, 1035610360.Google Scholar
Kosslyn, SM, Ganis, G and Thompson, WL (2001) Neural foundations of imagery. Nature Reviews Neuroscience 2, 635642.CrossRefGoogle ScholarPubMed
Liang, W, Zhang, Y, Kwon, Y, Yeung, S and Zou, J (2022) Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. arXiv preprint arXiv:2203.02053.Google Scholar
Lin, TY, Maire, M, Belongie, S, Hays, J, Perona, P, Ramanan, D, Dollár, P and Zitnick, CL (2014) Microsoft coco: common objects in context. Paper presented at the European conference on computer vision.CrossRefGoogle Scholar
Mansimov, E, Parisotto, E, Ba, JL and Salakhutdinov, R (2015) Generating images from captions with attention. arXiv preprint arXiv:1511.02793.Google Scholar
Muller, W (1989) Design discipline and the significance of visuo-spatial thinking. Design Studies 10, 1223.CrossRefGoogle Scholar
Pearce, MT and Wiggins, GA (2007) Evaluating cognitive models of musical composition. Paper presented at the Proceedings of the 4th International Joint Workshop on Computational Creativity.Google Scholar
Radford, A, Kim, JW, Hallacy, C, Ramesh, A, Goh, G, Agarwal, S, Sastry, G, Askell, A, Mishkin, P, Clark, J, Krueger, G and Sutskever, I (2021) Learning transferable visual models from natural language supervision. Paper presented at the Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research. Scholar
Ramesh, A, Pavlov, M, Goh, G, Gray, S, Voss, C, Radford, A and Sutskever, I (2021) Zero-shot text-to-image generation. Paper presented at the Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research. Scholar
Ramesh, A, Dhariwal, P, Nichol, A, Chu, C and Chen, M (2022) Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125.Google Scholar
Ravuri, S and Vinyals, O (2019) Classification accuracy score for conditional generative models. Advances in Neural Information Processing Systems 32, 1224712258.Google Scholar
Salimans, T, Goodfellow, I, Zaremba, W, Cheung, V, Radford, A and Chen, X (2016) Improved techniques for training GANs. Advances in Neural Information Processing Systems 29, 22262234.Google Scholar
Sarkar, P and Chakrabarti, A (2011) Assessing design creativity. Design Studies 32, 348383.CrossRefGoogle Scholar
Shin, A, Crestel, L, Kato, H, Saito, K, Ohnishi, K, Yamaguchi, M and Harada, T (2017) Melody generation for pop music via word representation of musical properties. arXiv preprint arXiv:1710.11549.Google Scholar
Sternberg, RJ and Kaufman, JC (2018) The Nature of Human Creativity. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Torrance, EP (1972) Predictive validity of the torrance tests of creative thinking. The Journal of Creative Behavior 6(4), 236252.CrossRefGoogle Scholar
Turing, I (2007) Computing machinery and intelligence-AM turing. Mind 59, 433.Google Scholar
Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN and Polosukhin, I (2017) Attention is all you need. Advances in Neural Information Processing Systems 30, 60006010.Google Scholar
Ward, TB and Kolomyts, Y (2010) Cognition and creativity. In The Cambridge Handbook of Creativity, pp. 93–112.CrossRefGoogle Scholar
Yang, L-C and Lerch, A (2020) On the evaluation of generative models in music. Neural Computing and Applications 32, 47734784.CrossRefGoogle Scholar
Zhang, H, Yin, W, Fang, Y, Li, L, Duan, B, Wu, Z, … and Wang, H (2021) ERNIE-ViLG: unified generative pre-training for bidirectional vision-language generation. arXiv preprint arXiv:2112.15283.Google Scholar