PFSA-ID: An Annotated Indonesian Corpus and Baseline Model of Public Figures Statements Attributions

3 minute read

This is my second publication as a PhD student at Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka. This article was published in Global Knowledge, Memory and Communication Journal, Vol. 73 No. 6/7, pp. 853-870, by Emerald Publishing Limited.

Authors: Yohanes Sigit Purnomo W.P., Yogan Jaya Kumar, Nur Zareen Zulkarnain

Language: English

Abstract:
Purpose
By far, the corpus for the quotation extraction and quotation attribution tasks in Indonesian is still limited in quantity and depth. This study aims to develop an Indonesian corpus of public figure statements attributions and a baseline model for attribution extraction, so it will contribute to fostering research in information extraction for the Indonesian language.

Design/methodology/approach
The methodology is divided into corpus development and extraction model development. During corpus development, data were collected and annotated. The development of the extraction model entails feature extraction, the definition of the model architecture, parameter selection and configuration, model training and evaluation, as well as model selection.

Findings
The Indonesian corpus of public figure statements attribution achieved 90.06% agreement level between the annotator and experts and could serve as a gold standard corpus. Furthermore, the baseline model predicted most labels and achieved 82.026% F-score.

Originality/value
To the best of the authors’ knowledge, the resulting corpus is the first corpus for attribution of public figures’ statements in the Indonesian language, which makes it a significant step for research on attribution extraction in the language. The resulting corpus and the baseline model can be used as a benchmark for further research. Other researchers could follow the methods presented in this paper to develop a new corpus and baseline model for other languages.

Keywords: Indonesian corpus, Public figures, Statement attribution, News article, Baseline model, Named entity recognition

Publication Details: Global Knowledge, Memory and Communication, Vol. 73 No. 6/7, pp. 853-870, 2024

DOI: 10.1108/GKMC-04-2022-0091

GITHUB REPOSITORY: https://github.com/sigit-purnomo/pfsa-id

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@article{PurnomoWP2024PFSAID,
	title = {PFSA-ID: an annotated Indonesian corpus and baseline model of public figures statements attributions},
	journal = {Global Knowledge, Memory and Communication},
	volume = {73},
	number = {6/7},
	pages = {853--870},
	year = {2024},
	issn = {2514-9342},
	doi = {10.1108/GKMC-04-2022-0091},
	url = {https://www.emerald.com/gkmc/article-abstract/73/6-7/853/1222985/PFSA-ID-an-annotated-Indonesian-corpus-and},
	author = {Yohanes Sigit {Purnomo W.P.} and Yogan Jaya Kumar and Nur Zareen Zulkarnain},
	keywords = {Indonesian corpus, Public figures, Statement attribution, News article, Baseline model, Named entity recognition},
	abstract = {Purpose By far, the corpus for the quotation extraction and quotation attribution tasks in Indonesian is still limited in quantity and depth. This study aims to develop an Indonesian corpus of public figure statements attributions and a baseline model for attribution extraction, so it will contribute to fostering research in information extraction for the Indonesian language. Design/methodology/approach The methodology is divided into corpus development and extraction model development. During corpus development, data were collected and annotated. The development of the extraction model entails feature extraction, the definition of the model architecture, parameter selection and configuration, model training and evaluation, as well as model selection. Findings The Indonesian corpus of public figure statements attribution achieved 90.06% agreement level between the annotator and experts and could serve as a gold standard corpus. Furthermore, the baseline model predicted most labels and achieved 82.026% F-score. Originality/value To the best of the authors' knowledge, the resulting corpus is the first corpus for attribution of public figures' statements in the Indonesian language, which makes it a significant step for research on attribution extraction in the language. The resulting corpus and the baseline model can be used as a benchmark for further research. Other researchers could follow the methods presented in this paper to develop a new corpus and baseline model for other languages.}
}

Share on

Twitter Facebook LinkedIn

Mixed Approach Speech-to-Text Translation for Endangered Language

2 minute read

Published: July 20, 2026

This study addresses the technological marginalization of endangered regional languages by evaluating speech-to-text translation for Dayak Ma’anyan, an extremely low-resource Austronesian language. It examines whether cascaded multilingual automatic speech recognition and machine translation models can provide effective Ma’anyan–Indonesian translation despite severe data scarcity.

Understanding Social-Cognitive-Norm Mechanisms Driving Disinformation Verification among Indonesian Young Adults on Social Media

2 minute read

Published: July 14, 2026

This research aimed to develop an integrated theoretical model to explain the factors influencing verification behavior regarding social media disinformation among young adults in Indonesia. The model combined the Stimulus–Organism–Response (SOR) framework with the Norm Activation Model (NAM) and the Social Identity Theory (SIT) to examine the collective effects of social, cognitive, and moral processes in shaping responsible information behavior. An online cross-sectional survey was conducted with 746 respondents, who actively used social networking sites to obtain and share information. The results showed that verification behavior was primarily driven by information skepticism and personal norms, emphasizing the importance of critical thinking and moral obligation for responsible engagement. In the Organism stage, awareness of fake information, perceived deception, and critical consumption enhanced moral sensitivity and analytical reasoning. At the social level, factors such as collective memory, parasocial interaction, and status-seeking were reported to be significant identity-based stimuli in shaping cognitive and moral responses, with gender found to moderate these effects. Thematic analysis suggested that most young adults verified information through cross-checking and peer consultation, but were influenced by social validation. Theoretically, this research contributed to disinformation research by framing verification as a cognitive-normative process rather than a reactive behavior. Different initiatives were recommended by educational institutions, governmental bodies, and community organizations to strengthen moral reasoning, digital literacy, and civic responsibility in combating disinformation.

Automated Rubric-Based Classification of Student Peer Code Review Feedback

2 minute read

Published: May 15, 2026

This study examines automated approaches for classifying student peer code review feedback in Bahasa Indonesia according to rubric-based code quality categories. The study used 2,281 student feedback items collected from peer code review activities in an introductory programming course. The dataset was annotated into seven labels and validated with strong inter-rater agreement. Three approaches were compared: classical machine learning, deep learning, and few-shot prompting using large language models. Random Forest with count vectorization produced the best performance with an F1-score of 0.9430, outperforming recurrent convolutional neural networks with FastText embeddings and few-shot prompting. The findings indicate that classical machine learning with token-based features can be an effective and interpretable baseline for supporting rubric alignment in peer code review tools, especially in computing education contexts using Bahasa Indonesia.

Leveraging Machine Learning in Student Peer Review: A Systematic Literature Review

2 minute read

Published: April 30, 2026

Our study examines how machine learning techniques are integrated into student peer review processes, focusing on the challenges that motivate their adoption and the methods used to address them. Using Kitchenham’s systematic literature review framework, 328 articles were screened, and 25 empirical studies on machine learning applications in student peer review were selected. The findings show that machine learning is mainly used to manage large volumes of reviews, support automated grading, and improve feedback quality. Common techniques include classification, prediction, ranking, and clustering, which help improve the fairness, efficiency, and objectivity of peer review. This study provides a rigorous synthesis of machine learning adoption in student peer review and highlights its potential to enhance assessment accuracy, support learning outcomes, and guide future research and broader implementation in educational contexts.

Sigit Purnomo

How to Cite

Share on

You may also enjoy

Mixed Approach Speech-to-Text Translation for Endangered Language

Understanding Social-Cognitive-Norm Mechanisms Driving Disinformation Verification among Indonesian Young Adults on Social Media

Automated Rubric-Based Classification of Student Peer Code Review Feedback

Leveraging Machine Learning in Student Peer Review: A Systematic Literature Review