Archiving with AI: how assisted indexing can widen access - aureka

Written by Dr. Cecilia Maas | Sep 29, 2023 12:16:01 PM

Artificial intelligence (AI) technologies can be of great help for archiving, as they can support the indexing of large collections and offer new forms of making holdings searchable. However, are we ready to let AI decide how to index archive’s or library’s holdings? Can we afford errors or inaccurate indexing in order to save time or will that damage the prestige of our institutions?

This post argues that it is possible to make use of AI without losing control over the relevance of the results and with a significant time saving related to manual indexing.

And also: this is not that much about how accurate AI is but rather about how the workflow is conceived and when automation and human control take place.

Indexing with AI

The task of assigning keywords to documents, either if they are textual, audio, video, images, etc. might be quite repetitive and mechanical but at the same time it requires a high level of understanding of the domain. Those who do this task should understand very well not only the content but also the context of a document and should also have a deep knowledge of the indexing categories.

So is this something that can be automatized with AI? Nowadays, It definitely is. Recently, innovations in AI models for text processing made them capable of handling context for a task, improving the relevance of the results enormously.

Depending on the case, out-of-the-box technologies might not be enough and some customization will be needed. But as we explained in this post, current natural language processing technologies, such as so-called Large Language Models are more than suited for this task.

What tasks do we perform when indexing a collection?

First of all, we need to understand what type of tasks we humans perform when doing indexing work and what type of task this would be for a machine. We will limit ourselves here to the scenario of indexing documents with a textual content, either if it is originally a text (such as books, magazines, newspapers, government documents, etc) or text generated from a different media (such as audio or video transcript, or text extracted from an image through OCR).

There are a few different scenarios in which we might be indexing: it could either be a new collection from which we know something about the context but little about the content, we might know more about the content and be quite sure about what we will find, or we might be adding new objects to an existing collection. We could also be indexing with an existing controlled vocabulary or have the task to extend it or even create a new one.

In order to know whether these tasks can be automated by AI, we need to isolate the cognitive process. To put it in simple terms and with the risk of some simplification I would say that we will be performing mainly three tasks: extraction, generation and classification.

In some cases, we are extracting terms from the document. For example, when we identify names of people or places and use them as keywords. In this case, we are picking a term from the content of the document as it is.

Another situation would be if we want to have keywords that describe the subject of a document. If we have a thesaurus and we need to match documents or fragments with it, we will be performing a classification task by finding the closest term in the vocabulary.

If we, on the other hand, have no vocabulary from which to choose the terms or estimate that the document might address topics for which there are no terms in the thesaurus, we need to generate new keywords. For this, we might take terms that are mentioned explicitly in the text and are descriptive of the subject. We might also need to come up with other concepts that do not appear in the text, for example because they are more abstract.

Assisted cataloging with AI Language Models

Luckily for us, extraction, generation and classification are jobs that language models are trained to do. There are many different models and several approaches to these tasks, and deciding for one requires a detailed analysis of the data that needs to be indexed and the workflow of the organization doing it. In our experience building systems for AI assisted indexing, we have been using a combination of the following methods:

AI can extract the names of people, places and organizations as well as dates from a text through a process called Named Entity Recognition. For example, the system could identify that in the sentence “Willy Brandt, a politician from the Social Democratic Party, was the chancellor of Western Germany between 1969 and 1974”, “Willy Brandt” is the name of a person, “Social Democratic Party” is the name of an organization, “Western Germany” is a place and “1969” and “1974” are dates. To perform this task, models are trained with large amounts of text and learn to identify which words correspond to names or dates because of their context and position in a sentence.
If we aim at assigning keywords that are part of controlled vocabulary, we will then have language models perform classification tasks. We’ve had good results with an approach that combines two main steps: first, it creates a mathematical representation of the words and sentences that are to be classified as well as from all the terms in the thesaurus. These representations are called word or sentence embeddings. Then, it computes the semantic similarity between each fragment to be classified (either the whole document or fragments of it) and the terms in the thesaurus. This process is called semantic search.
If a thesaurus is not available or we want to extend an existing one, then we need to come up with keywords beyond the constraints of a vocabulary. This is quite challenging, as having no constraints makes it more difficult to get relevant results. We’ve used large language models such as GPT, for this goal, combining the generation of a summary of the document with the extraction of descriptive keywords to make sure that we get those concepts that better describe the content with a certain level of abstraction.

Conclusion

The new generation of language models are very well suited for the tasks we need to support the process of indexing archival collections. However, there are many different methods and approaches and a decision on which ones to use depends on the collection and workflow of every institution.

View full post