Exploring NPL: Generating Automatic Control Keywords
keyword indexation, NLP keywords, keyword extractionAbstract
Keywords are a tool to help indexers and search engines find relevant papers. Unfortunately, authors use them wrong, unintentionally or to misleading readers into a non-related topic, promoting their articles by using non-representative keywords. Previous scholars (Ansari, 2005; Voorbij, 1998) exposed lack of consistence between abstracts, full-texts and keywords. This is an old but effective practice. An early investigation conducted by Schultz, Schultz and Orr on 1965 matched author keywords to document titles and to indexing terms appointed by subject matter experts, and found out the author supplied keywords matched more closely the terms used by subject matter experts than did the title terms (as cited in Kipp, 2011, p. 249). Fifty-five year after, Terra et al. (2020) suggest seven improvements to keyword parameterization. In fact, author keywords have received relatively little attention in the literature, according to Kipp (2007). Moreover, with the ever-increasing academic data available, finding relevant documents has become more challenging for regular users and library specialists.
The purpose of this article is to generate theses keywords using NLP - Natural language processing techniques; NLP is a subfield of linguistics, computer science, and artificial intelligence, taking advance of big data, indexing data while removing human errors and costs (Moskovitch, Martins, Behiri, Weiss & Shahar 2007).
Design/methodology/approach: A 95% sample population of 51.010 master theses population, from the institutional repository of the University of São Paulo, was extracted and selected, representing 48.501 records, then a thematic dictionary was created based on theses major area, subsequently generating the theses’ keywords established by the previous dictionary.
Research limitations/implications: The effectiveness of information retrieval is highly dependent on the accurate and complete representation of document content and major area of the theses.
Originality/value: Author keywords have received relatively little attention in the literature (as cited in Kipp, 2011). Not due to lack of importance for all stakeholders, but because of the complexity involved on the task and publisher lack of control. This paper highlights a new method to generate and control author keywords.
