Understanding quotation extraction and attribution: towards automatic extraction of public figure’s statements for journalism in Indonesia

3 minute read

This is my first publication as a PhD student at Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka. This article published in Global Knowledge, Memory and Communication Journal from Emerald Group Publishing Ltd.

Authors: Yohanes Sigit Purnomo W.P., Yogan Jaya Kumar, Nur Zareen Zulkarnain

Language: English

Abstract:
Purpose
Extracting information from unstructured data becomes a challenging task for computational linguistics. Public figure’s statement attributed by journalists in a story is one type of information that can be processed into structured data. Therefore, having the knowledge base about this data will be very beneficial for further use, such as for opinion mining, claim detection and fact-checking. This study aims to understand statement extraction tasks and the models that have already been applied to formulate a framework for further study.

Design/methodology/approach
This paper presents a literature review from selected previous research that specifically addresses the topics of quotation extraction and quotation attribution. Research works that discuss corpus development related to quotation extraction and quotation attribution are also considered. The findings of the review will be used as a basis for proposing a framework to direct further research.

Findings
There are three findings in this study. Firstly, the extraction process still consists of two main tasks, namely, the extraction of quotations and the attribution of quotations. Secondly, most extraction algorithms rely on a rule-based algorithm or traditional machine learning. And last, the availability of corpus, which is limited in quantity and depth. Based on these findings, a statement extraction framework for Indonesian language corpus and model development is proposed.

Originality/value
The paper serves as a guideline to formulate a framework for statement extraction based on the findings from the literature study. The proposed framework includes a corpus development in the Indonesian language and a model for public figure statement extraction. Furthermore, this study could be used as a reference to produce a similar framework for other languages.

Keywords: Journalism, Online News, Corpus Development, Indonesian Language, Quotation Extraction, Quotation Attribution, Statement Extraction

DOI: 10.1108/GKMC-07-2020-0098

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@article{PURNOMOWP2020,
	title = {Understanding quotation extraction and attribution: towards automatic extraction of public figure’s statements for journalism in Indonesia},
	journal = {Global Knowledge, Memory and Communication},
	volume = {70},
	pages = {655-671},
	year = {2020},
	issn = {2514-9342},
	doi = {https://doi.org/10.1108/GKMC-07-2020-0098},
	url = {https://www.emerald.com/insight/content/doi/10.1108/GKMC-07-2020-0098/full/html},
	author = {Yohanes Sigit {Purnomo W.P.} and Yogan Jaya Kumar and Nur Zareen Zulkarnain},
	keywords = {Journalism, Online News, Corpus Development, Indonesian Language, Quotation Extraction, Quotation Attribution, Statement Extraction},
	abstract = {Purpose. Extracting information from unstructured data becomes a challenging task for computational linguistics. Public figure’s statement attributed by journalists in a story is one type of information that can be processed into structured data. Therefore, having the knowledge base about this data will be very beneficial for further use, such as for opinion mining, claim detection and fact-checking. This study aims to understand statement extraction tasks and the models that have already been applied to formulate a framework for further study. Design/methodology/approach. This paper presents a literature review from selected previous research that specifically addresses the topics of quotation extraction and quotation attribution. Research works that discuss corpus development related to quotation extraction and quotation attribution are also considered. The findings of the review will be used as a basis for proposing a framework to direct further research. Findings. There are three findings in this study. Firstly, the extraction process still consists of two main tasks, namely, the extraction of quotations and the attribution of quotations. Secondly, most extraction algorithms rely on a rule-based algorithm or traditional machine learning. And last, the availability of corpus, which is limited in quantity and depth. Based on these findings, a statement extraction framework for Indonesian language corpus and model development is proposed. Originality/value. The paper serves as a guideline to formulate a framework for statement extraction based on the findings from the literature study. The proposed framework includes a corpus development in the Indonesian language and a model for public figure statement extraction. Furthermore, this study could be used as a reference to produce a similar framework for other languages.}
}