Structured Data From Dictionary Text: Applying LLMs For Low-Resource Cross-Lingual Information Extraction
Document Type
Conference Proceeding
Publication Date
7-3-2025
Published In
Analysis of Images, Social Networks and Texts
Abstract
The development of machine-readable lexical resources for low-resource languages, such as Kyrgyz, faces significant challenges due to limited NLP tools and poorly structured linguistic data. In this paper, we introduce an innovative method for extracting structured lexical information from Yudakhin’s Russian-Kyrgyz dictionary, a bilingual resource with inconsistent entry formatting. Our approach utilizes GPT-4o to bootstrap a dataset and explores both few-shot learning and fine-tuning techniques to convert dictionary entries into a structured JSON schema. We assess the impact of varying few-shot example sizes on model performance and compare the effectiveness of few-shot learning against fine-tuning across several models, including an open-source option. Our results demonstrate notable success, with the highest-performing model achieving 92.70% accuracy, 95.60% precision, 91.64% recall, and a 93.56% F1 score. This study highlights the potential of large language models in cross-lingual information extraction for low-resource languages and offers a scalable, cost-effective solution for digitizing complex bilingual dictionaries.
Keywords
cross-lingual information extraction, lexicography, low-resource languages, large language models, LLM evaluation, Kyrgyz
Published By
Springer
Editor(s)
A. Panchenko, D. Gubanov, M. Khachay, A. Kutuzov, N. Loukachevitch, A. Kuznetsov, I. Nikishina, M. Panov, P. M. Pardalos, A. V. Savchenko, E. Tsymbalov, E. Tutubalina, A. Kasieva, and D. I. Ignatov
Conference
12th International Conference, AIST 2024
Conference Dates
October 17-19, 2024
Conference Location
Bishkek, Kyrgyzstan
Recommended Citation
M. Jumashev, A. Kasieva, G. Dzhumalieva, A. Tursunova, M. Ryspakova, and Jonathan North Washington.
(2025).
"Structured Data From Dictionary Text: Applying LLMs For Low-Resource Cross-Lingual Information Extraction".
Analysis of Images, Social Networks and Texts.
Volume 2364,
18-32.
DOI: 10.1007/978-3-031-97019-1_2
https://works.swarthmore.edu/fac-linguistics/275