Exploring NPL: Generating Automatic Control Keywords


  • Óscar Bernardes Porto Accounting and Business School, Polytechnic Institute of Porto (PORTUGAL)
  • Vanessa Amorim Porto Accounting and Business School, Polytechnic Institute of Porto (PORTUGAL)




keyword indexation, NLP keywords, keyword extraction


Keywords are a tool to help indexers and search engines find relevant papers. Unfortunately, authors use them wrong, unintentionally or to misleading readers into a non-related topic, promoting their articles by using non-representative keywords. Previous scholars (Ansari, 2005; Voorbij, 1998) exposed lack of consistence between abstracts, full-texts and keywords. This is an old but effective practice. An early investigation conducted by Schultz, Schultz and Orr on 1965 matched author keywords to document titles and to indexing terms appointed by subject matter experts, and found out the author supplied keywords matched more closely the terms used by subject matter experts than did the title terms (as cited in Kipp, 2011, p. 249). Fifty-five year after, Terra et al. (2020) suggest seven improvements to keyword parameterization. In fact, author keywords have received relatively little attention in the literature, according to Kipp (2007). Moreover, with the ever-increasing academic data available, finding relevant documents has become more challenging for regular users and library specialists.

The purpose of this article is to generate theses keywords using NLP - Natural language processing techniques; NLP is a subfield of linguistics, computer science, and artificial intelligence, taking advance of big data, indexing data while removing human errors and costs (Moskovitch, Martins, Behiri, Weiss & Shahar 2007).

Design/methodology/approach: A 95% sample population of 51.010 master theses population, from the institutional repository of the University of São Paulo, was extracted and selected, representing 48.501 records, then a thematic dictionary was created based on theses major area, subsequently generating the theses’ keywords established by the previous dictionary.

Research limitations/implications: The effectiveness of information retrieval is highly dependent on the accurate and complete representation of document content and major area of the theses.

Originality/value: Author keywords have received relatively little attention in the literature (as cited in Kipp, 2011). Not due to lack of importance for all stakeholders, but because of the complexity involved on the task and publisher lack of control. This paper highlights a new method to generate and control author keywords.


Ansari, M. (2005). Matching between assigned descriptors and title keywords in medical theses. Library Review, 54(7), 410-414. https://www.emeraldinsight.com/doi/abs/10.1108/00242530510611901. DOI: 10.1108/00242530510611901

Beliga, S., Meštrović, A., & Martincic-Ipsic, S. (2015). An Overview of Graph-Based Keyword Extraction Methods and Approaches. Journal of Information and Organizational Sciences, 39, 1-20.

Cavnar, W. & Trenkle, J. (2001). N-Gram-Based Text Categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval. Retrieved from https://www.let.rug.nl/~vannoord/TextCat/textcat.pdf

Decong L., Sujian L., Wenjie L., Wei W., & Weiguang, Q. (2010). A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network. In Proceedings of the ACL 2010 Conference Short Papers (pp. 296–300).

Gil-Leiva, I. & Alonso-Arroyo, A. (2007). Keywords given by authors of scientific articles in database descriptors. Journal of the American Society for Information Science and Technology, 58. 10.1002/asi.20595.

Glez-Peña, Daniel & Lourenco, Anália & López-Fernández, Hugo & Reboiro-Jato, Miguel & Fdez-Riverola, Florentino. (2013). Web scraping technologies in an API world, Briefings in bioinformatics, 15(5), 788–797.

International Association for Standardization (ISO). (1985). Documentation. Methods for examining documents, determining their subjects, and selecting indexing terms (ISO 5963:1985). Geneva, Switzerland.

Igal, Z. (2016). Bot Traffic Report. Retrieved from https://www.incapsula.com/blog/bot-traffic-report-2016.html

Kaur, J. & Gupta, V. (2010). Effective Approaches for Extraction of Keywords. International Journal of Computer Science, 7.

Kim, Y., Lee, J., & Choi, S. (2020). Validation of deep learning natural language processing algorithm for keyword extraction from pathology reports in electronic health records. Sci Rep 10 (Article n. 20265). https://doi.org/10.1038/s41598-020-77258-w

Kipp, M. (2007). Tagging Practices on Research Oriented Social Bookmarking Sites. Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI. 30. 10.29173/cais223. DOI: 10.29173/cais223

Kipp, M. (2011). Tagging of Biomedical Articles on CiteULike: A Comparison of User, Author and Professional Indexing. Knowledge Organization, 38(3). DOI: 10.5771/0943-7444-2011-3-245

Liddy, E. (2001). Natural language processing. In Encyclopedia of library and information science (2nd ed., pp. 2126–2136). New York, NY: Marcel Dekker.

Litvak, M., Last, M., Aizenman, H., Gobits, I., & Kandel, A. (2011). DegExt - A Language Independent Graph-Based Keyphrase Extractor. Advances in Intelligent Web Mastering - 3, AISC, 86, 121- 130.

Moskovitch, R., Martins, S. B., Behiri, E., Weiss, A., Shahar, Y. A comparative evaluation of full-text, concept- based, and context-sensitive search. Journal of the American Medical Informatics Association: JAMIA. 2007 Mar-Apr;14(2), 164-174. DOI: 10.1197/jamia.m1953.

Papagiannopoulou, E. & Tsoumakas, G. (2019). A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 10. 10.1002/widm.1339.

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. In Text mining: applications and theory, 1 (pp. 1–20).

Strader, C. (2009). Author-Assigned Keywords versus Library of Congress Subject Headings Implications for the Cataloging of Electronic Theses and Dissertations. Library Resources and Technical Services, 53, 243-250. DOI: 10.5860/lrts.53n4.243.

Terra, A., Lacruz, C., Bernardes. O, Fujita, M., & Fuente, G. (2020). Subject-access metadata on ETD supplied by authors: A case study about keywords, titles and abstracts in a Brazilian academic repository. The Journal of Academic Librarianship, 47(1). https://doi.org/10.1016/j.acalib.2020.102268

Turney, P. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2, 303–336.

Voorbij, H. J. (1998). Title keywords and subject descriptors: a comparison of subject search entries of books in the humanities and social sciences. Journal of Documentation, 54(4), 466–476. DOI: 10.1108/EUM0000000007178

Witten, I., Paynter, G., Frank, E., Gutwin, C., & Nevill-Manning, C. (1999). Practical Automatic Keyphrase Extraction. In Proceedings of the 4th ACM Conference of the Digital Libraries, DL ’99 (pp. 254-255), Berkeley, CA, USA.

Ying, Yan, Qingping, Tan, Qinzheng, Xie, Ping, Zeng & Panpan, Li. (2017). A Graph-based Approach of Automatic Keyphrase Extraction. Procedia Computer Science, 107, 248-255. 10.1016/j.procs.2017.03.087.

Yousefi, Zahra & Sotudeh, Hajar & Mirzabeigi, Mahdieh & Nikseresht, Alireza & Mohammadi, Mehdi. (2019). Investigating text power in predicting semantic similarity. International Journal of Information Science and Management, 17(1), 17-31.

Zhou, Z., Zou, X., Lv, X., Hu, J. (2013). Research on Weighted Complex Network Based Keywords Extraction. Chinese Lexical Semantics, LNCS 8229, 442-452.




How to Cite

Bernardes, Óscar, & Amorim, V. . (2023). Exploring NPL: Generating Automatic Control Keywords. Bobcatsss, 130–139. https://doi.org/10.34630/bobcatsss.vi.4970