Artificial intelligence (AI) technologies can be of great help for archiving, as they can support the indexing of large collections and offer new forms of making holdings searchable. However, are we ready to let AI decide how to index archive’s or library’s holdings? Can we afford errors or inaccurate indexing in order to save time or will that damage the prestige of our institutions?
This post argues that it is possible to make use of AI without losing control over the relevance of the results and with a significant time saving related to manual indexing.
And also: this is not that much about how accurate AI is but rather about how the workflow is conceived and when automation and human control take place.
The task of assigning keywords to documents, either if they are textual, audio, video, images, etc. might be quite repetitive and mechanical but at the same time it requires a high level of understanding of the domain. Those who do this task should understand very well not only the content but also the context of a document and should also have a deep knowledge of the indexing categories.
So is this something that can be automatized with AI? Nowadays, It definitely is. Recently, innovations in AI models for text processing made them capable of handling context for a task, improving the relevance of the results enormously.
Depending on the case, out-of-the-box technologies might not be enough and some customization will be needed. But as we explained in this post, current natural language processing technologies, such as so-called Large Language Models are more than suited for this task.
First of all, we need to understand what type of tasks we humans perform when doing indexing work and what type of task this would be for a machine. We will limit ourselves here to the scenario of indexing documents with a textual content, either if it is originally a text (such as books, magazines, newspapers, government documents, etc) or text generated from a different media (such as audio or video transcript, or text extracted from an image through OCR).
There are a few different scenarios in which we might be indexing: it could either be a new collection from which we know something about the context but little about the content, we might know more about the content and be quite sure about what we will find, or we might be adding new objects to an existing collection. We could also be indexing with an existing controlled vocabulary or have the task to extend it or even create a new one.
In order to know whether these tasks can be automated by AI, we need to isolate the cognitive process. To put it in simple terms and with the risk of some simplification I would say that we will be performing mainly three tasks: extraction, generation and classification.
In some cases, we are extracting terms from the document. For example, when we identify names of people or places and use them as keywords. In this case, we are picking a term from the content of the document as it is.
Another situation would be if we want to have keywords that describe the subject of a document. If we have a thesaurus and we need to match documents or fragments with it, we will be performing a classification task by finding the closest term in the vocabulary.
If we, on the other hand, have no vocabulary from which to choose the terms or estimate that the document might address topics for which there are no terms in the thesaurus, we need to generate new keywords. For this, we might take terms that are mentioned explicitly in the text and are descriptive of the subject. We might also need to come up with other concepts that do not appear in the text, for example because they are more abstract.
Luckily for us, extraction, generation and classification are jobs that language models are trained to do. There are many different models and several approaches to these tasks, and deciding for one requires a detailed analysis of the data that needs to be indexed and the workflow of the organization doing it. In our experience building systems for AI assisted indexing, we have been using a combination of the following methods:
The new generation of language models are very well suited for the tasks we need to support the process of indexing archival collections. However, there are many different methods and approaches and a decision on which ones to use depends on the collection and workflow of every institution.