Can smaller large language models evaluate research quality?

Mike Thelwall

doi:10.22452/mjlis.vol30no2.4

Authors

Mike Thelwall School of Journalism, Media and Communication, University of Sheffield, 2 Whitham Rd, Sheffield S102AH, Sheffield, UK

DOI:

https://doi.org/10.22452/mjlis.vol30no2.4

Keywords:

Scientometrics, Large Language Models, Gemma, Open weights LLMs

Abstract

Academic librarians often construct bibliometric indicators to support research evaluation. Traditionally, these have been citation-based, but AI alternatives have recently emerged. Although both Google Gemini (1.5 Flash) and ChatGPT (4o and 4o-mini) provide research quality evaluation scores that correlate positively with expert scores in nearly all fields, and more strongly than citations in most, it is not known whether this holds for smaller Large Language Models (LLMs). In response, this article assesses Google’s Gemma-3-27b-it, a downloadable LLM (60 GB). Results for 104,187 articles show that Gemma-3-27b-it scores correlate positively with an expert research quality score proxy for all 34 Units of Assessment (broad fields) from the UK Research Excellence Framework 2021. The Gemma-3-27b-it correlations have 83.8% of the strength of ChatGPT 4o and 94.7% of the strength of ChatGPT 4o-mini correlations. Unlike the two larger LLMs, the Gemma-3-27b-it correlations do not increase substantially when scores are averaged across five repetitions, its scores tend to be lower, and its reports are relatively uniform in style. Overall, the results show that research quality score estimation can be conducted by offline LLMs, so this capability is not an emergent property of only the largest LLMs. Moreover, score improvement through repetition is not a universal feature of LLMs. In conclusion, although the largest LLMs still have the highest research evaluation score estimation capability, smaller ones can also be used for this task, which can be helpful for cost saving or when secure offline processing is required.

Downloads

Download data is not yet available.

References

Carbonell Cortés, C., Parra-Rojas, C., Pérez-Lozano, A., Arcara, F., Vargas-Sánchez, S., Fernández-Montenegro, R., Casado-Marín, D., Rondelli, B. & López-Verdeguer, I. (2024). AI-assisted prescreening of biomedical research proposals: Ethical considerations and the pilot case of “la Caixa” Foundation. Data & Policy, 6, e49. https://doi.org/10.1017/dap.2024.41.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y.-X., Wang, X.-Z., Dehghani, M., Brahma, S., Webson, A., Gu, S.-X. S., Dai, Z.-Y., Suzgun, M., Chen, X.-Y., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., … Wei, J. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(1), Article 70. https://www.jmlr.org/papers/volume25/23-0870/23-0870.pdf.

Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden Manifesto for research metrics. Nature, 520(7548), 429-431. https://doi.org/10.1038/520429a.

Huang, K., Mo, F.-R., Li, H.-L., Li, Y., Zhang, Y.-C., Yi, W.-J., Mao, Y.-L-., Liu, J.-C., Xu, Y.-Z., Xu, J.-N., Nie, J.-Y., & Liu, Y. (2024). A survey on large language models with multilingualism: Recent advances and new frontiers. arXiv preprint, arXiv:2405.10936. https://doi.org/10.48550/arXiv.2405.10936.

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79-87. https://doi.org/10.1162/neco.1991.3.1.79.

Kousha, K., & Thelwall, M. (2025). Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations. Journal of the Association for Information Science and Technology, 76(10), 1357-1373. https://doi.org/10.1002/asi.25021.

Langfeldt, L., Nedeva, M., Sörlin, S., & Thomas, D. A. (2020). Co-existing notions of research quality: A framework to study context-specific understandings of good research. Minerva, 58(1), 115-137. https://doi.org/10.1007/s11024-019-09385-2.

Moed, H. F. (2005). Citation analysis in research evaluation. Springer Dordrecht. https://doi.org/10.1007/1-4020-3714-7.

Mohammadi, E., Thelwall, M., Cai, Y.-Z., Collier, T, Tahamtan, I, & Eftekhar, A. (2026). Is generative AI reshaping academic practices worldwide? A survey of adoption, benefits, and concerns. Information Processing & Management, 63(1), 104350. https://doi.org/10.1016/j.ipm.2025.104350.

Mukherjee, B. (2022). Evaluating the research performance of women scientists in Indian research laboratories based on Scopus citation database: A bibliometric analysis. Malaysian Journal of Library and Information Science, 27(1), 57-72. https://doi.org/10.22452/mjlis.vol27no1.4.

Nowak, S., Wulff, B., Layer, Y. C., Theis, M., Isaak, A., Salam, B., Block, W., Kuetting, D., Pieper, C. C., Luetkens, J. A., Attenberger, U., & Sprinkart, A. M. (2025). Privacy-ensuring open-weights large language models are competitive with closed-weights GPT-4o in extracting chest radiography findings from free-text reports. Radiology, 314(1), e240895. https://doi.org/10.1148/radiol.240895.

Qi, S.-H., Cao, Z.-Y., Rao, J., Wang, L., Xiao, J., & Wang, X.-A. (2023). What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing. Information Processing & Management, 60(6), 103510. https://doi.org/10.1016/j.ipm.2023.103510.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G., & Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint, arXiv:1701.06538.

Sowe, S., Mou, Y.-L., Cheng, D., Kong, L.-X., Neumann, A. T., & Decker, S. (2024). Understanding open source large language models: An exploratory study. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM) (pp. 132-140). IEEE. https://doi.org/10.1109/fllm63129.2024.10852438.

Thakkar, N., Yuksekgonul, M., Silberg, J., Garg, A., Peng, N.-Y., Sha, F., Yu, R., Vondrik, C., & Zou, J. (2025). Can LLM feedback enhance review quality? A randomized study of 20k reviews at ICLR 2025. arXiv preprint, arXiv:2504.09737.

Thelwall, M., & Kurt, Z. (2025). Research evaluation with ChatGPT: Is it age, country, length, or field biased? Scientometrics, arXiv preprint, arXiv:2411.09768. https://doi.org/10.1007/s11192-025-05393-0.

Thelwall, M. (2025a). Is Google Gemini better than ChatGPT at evaluating research quality? Journal of Data and Information Science, 10(2), 1–5. https://doi.org/10.2478/jdis-2025-0014.

Thelwall, M. (2025b). In which fields do ChatGPT 4o scores align better than citations with research quality? arXiv preprint, arXiv:2504.04464. https://doi.org/10.48550/arXiv.2504.04464.

Thelwall, M. (2025c). Evaluating research quality with large language models: An analysis of ChatGPT’s effectiveness with different settings and inputs. Journal of Data and Information Science, 10(1), 7-25. https://doi.org/10.2478/jdis-2025-0011.

Thelwall, M. (2025d). Research quality evaluation by AI in the era of large language models: Advantages, disadvantages, and systemic effects - An opinion paper. Scientometrics. https://doi.org/10.1007/s11192-025-05361-8.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research, arXiv:2206.07682. https://doi.org/10.48550/arXiv.2206.07682.

Wu, M.-J., Zhang, Y., Haunschild, R., & Bornmann, L. (2025). Leveraging large language models for post-publication peer review: Potential and limitations. In S. Sargsyan, W. Glänzel, & G. Abramo (Eds.), Proceedings of the 20th International Conference on Scientometrics & Informetrics (ISSI 2025), (pp. 1207-1226). International Society for Scientometrics and Informetrics. https://doi.org/10.51408/issi2025_078.

Xiao, G.-X., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models. In A. Krause, E. Brunskill, K.-H. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), ICML’ 23: Proceedings of the 40th International Conference on Machine Learning (pp. 38087-38099). PMLR. https://proceedings.mlr.press/v202/xiao23c.html.

Zhou, R.-Y., Chen, L., & Yu, K. (2024). Is LLM a reliable reviewer? A comprehensive evaluation of LLM on automatic paper reviewing tasks. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N.-W. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 9340-9351). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.816/

Can smaller large language models evaluate research quality?

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

NAVIGATION