Structured Data From Dictionary Text: Applying LLMs For Low-Resource Cross-Lingual Information Extraction

Document Type

Conference Proceeding

Publication Date

7-3-2025

Published In

Analysis of Images, Social Networks and Texts

Abstract

The development of machine-readable lexical resources for low-resource languages, such as Kyrgyz, faces significant challenges due to limited NLP tools and poorly structured linguistic data. In this paper, we introduce an innovative method for extracting structured lexical information from Yudakhin’s Russian-Kyrgyz dictionary, a bilingual resource with inconsistent entry formatting. Our approach utilizes GPT-4o to bootstrap a dataset and explores both few-shot learning and fine-tuning techniques to convert dictionary entries into a structured JSON schema. We assess the impact of varying few-shot example sizes on model performance and compare the effectiveness of few-shot learning against fine-tuning across several models, including an open-source option. Our results demonstrate notable success, with the highest-performing model achieving 92.70% accuracy, 95.60% precision, 91.64% recall, and a 93.56% F1 score. This study highlights the potential of large language models in cross-lingual information extraction for low-resource languages and offers a scalable, cost-effective solution for digitizing complex bilingual dictionaries.

Keywords

cross-lingual information extraction, lexicography, low-resource languages, large language models, LLM evaluation, Kyrgyz

Published By

Springer

Editor(s)

A. Panchenko, D. Gubanov, M. Khachay, A. Kutuzov, N. Loukachevitch, A. Kuznetsov, I. Nikishina, M. Panov, P. M. Pardalos, A. V. Savchenko, E. Tsymbalov, E. Tutubalina, A. Kasieva, and D. I. Ignatov

Conference

12th International Conference, AIST 2024

Conference Dates

October 17-19, 2024

Conference Location

Bishkek, Kyrgyzstan

Share

COinS