PFSA-ID: An Annotated Indonesian Corpus and Baseline Model of Public Figures Statements Attributions
This is my second publication as a PhD student at Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka. This article published in Global Knowledge, Memory and Communication Journal from Emerald Group Publishing Ltd.
Authors: Yohanes Sigit Purnomo W.P., Yogan Jaya Kumar, Nur Zareen Zulkarnain
Language: English
Abstract:
Purpose
By far, the corpus for the quotation extraction and quotation attribution tasks in Indonesian is still limited in quantity and depth. This study aims to develop an Indonesian corpus of public figure statements attributions and a baseline model for attribution extraction, so it will contribute to fostering research in information extraction for the Indonesian language.
Design/methodology/approach
The methodology is divided into corpus development and extraction model development. During corpus development, data were collected and annotated. The development of the extraction model entails feature extraction, the definition of the model architecture, parameter selection and configuration, model training and evaluation, as well as model selection.
Findings
The Indonesian corpus of public figure statements attribution achieved 90.06% agreement level between the annotator and experts and could serve as a gold standard corpus. Furthermore, the baseline model predicted most labels and achieved 82.026% F-score.
Originality/value
To the best of the authors’ knowledge, the resulting corpus is the first corpus for attribution of public figures’ statements in the Indonesian language, which makes it a significant step for research on attribution extraction in the language. The resulting corpus and the baseline model can be used as a benchmark for further research. Other researchers could follow the methods presented in this paper to develop a new corpus and baseline model for other languages.
Keywords: Indonesian corpus, Public figures, Statement attribution, News article, Baseline model, Named entity recognition
DOI: 10.1108/GKMC-04-2022-0091
GITHUB REPOSITORY: https://github.com/sigit-purnomo/pfsa-id
How to Cite
If you extend or use this work, please cite the paper where it was introduced:
@article{PURNOMOWP2022,
title = {PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions},
journal = {Global Knowledge, Memory and Communication},
volume = {ahead-of-print},
pages = {ahead-of-print},
year = {2022},
issn = {2514-9342},
doi = {https://doi.org/10.1108/GKMC-04-2022-0091},
url = {https://www.emerald.com/insight/content/doi/10.1108/GKMC-04-2022-0091/full/html},
author = {Yohanes Sigit {Purnomo W.P.} and Yogan Jaya Kumar and Nur Zareen Zulkarnain},
keywords = {Indonesian corpus, Public figures, Statement attribution, News article, Baseline model, Named entity recognition},
abstract = {Purpose By far, the corpus for the quotation extraction and quotation attribution tasks in Indonesian is still limited in quantity and depth. This study aims to develop an Indonesian corpus of public figure statements attributions and a baseline model for attribution extraction, so it will contribute to fostering research in information extraction for the Indonesian language. Design/methodology/approach The methodology is divided into corpus development and extraction model development. During corpus development, data were collected and annotated. The development of the extraction model entails feature extraction, the definition of the model architecture, parameter selection and configuration, model training and evaluation, as well as model selection. Findings The Indonesian corpus of public figure statements attribution achieved 90.06% agreement level between the annotator and experts and could serve as a gold standard corpus. Furthermore, the baseline model predicted most labels and achieved 82.026% F-score. Originality/value To the best of the authors’ knowledge, the resulting corpus is the first corpus for attribution of public figures’ statements in the Indonesian language, which makes it a significant step for research on attribution extraction in the language. The resulting corpus and the baseline model can be used as a benchmark for further research. Other researchers could follow the methods presented in this paper to develop a new corpus and baseline model for other languages.}
}