Education, Science, Technology, Innovation and Life
Open Access
Sign In

Improving POS Tagging for Singlish via Data Weighting

Download as PDF

DOI: 10.23977/langl.2026.090104 | Downloads: 2 | Views: 28

Author(s)

Chaojie Lin 1, Xiaoxi Luo 2

Affiliation(s)

1 Trinity College School, Cambridge, Ontario, Canada
2 University of Waterloo, Waterloo, Ontario, Canada

Corresponding Author

Xiaoxi Luo

ABSTRACT

Singlish, or Colloquial Singapore English, is an English-based contact language influenced by multiple substrate languages, including Malay, Tamil, and Southern Chinese varieties. Its mixed grammatical patterns and vocabulary pose significant challenges for standard NLP tools, particularly part-of-speech (POS) tagging. In this study, we investigate whether a simple data-centric strategy—up-weighting Singlish training data while including Standard English UD examples—can improve POS tagging performance without complex architectures. Using an averaged perceptron tagger, we show that the weighted training setup achieves higher accuracy than the Singlish-only baseline, reduces error variance, and successfully captures Chinese-derived grammatical structures. Error analysis indicates that most tagging errors arise from POS polysemy rather than code-mixing, highlighting the effectiveness of data weighting in low-resource settings. Our results suggest that careful data design alone can yield meaningful improvements for processing creole and contact languages.

KEYWORDS

Singlish, Part-of-Speech Tagging, Low-Resource Languages, Perceptron Tagger, Data Weighting

CITE THIS PAPER

Chaojie Lin, Xiaoxi Luo. Improving POS Tagging for Singlish via Data Weighting. Lecture Notes on Language and Literature (2026). Vol. 9, No.1, 27-32. DOI: http://dx.doi.org/10.23977/langl.2026.090104.

REFERENCES

[1] Bao, Z. (2005). The aspectual system of Singapore English and the systemic  substratist  explanation.  Journal  of  Linguistics,  41(2),  237–267. https://doi.org/10.1017/S0022226705003269
[2] Gupta, A. F. (1992). The pragmatic particles of Singapore Colloquial English. Journal of Pragmatics, 18(1), 31–57. https://doi.org/10.1016/0378-2166(92)90106-L
[3]Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1373–1378). Association for Computational Linguistics.
[4] Jurafsky, D., & Martin, J. H. (2025). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition with language models (3rd ed.). Online manuscript. Stanford University.
[5] Lim, L. (2004). Singapore English. John Benjamins Publishing  Company. https://doi.org/10.1075/veaw.g33
[6] Wang, H., Yang, J., & Zhang, Y. (2019). From genesis to creole language. ACM Transactions on Asian and Low-Resource Language Information Processing, 19(1), 1–29. https://doi.org/10.1145/3321128
[7] Wang, H., Zhang, Y., Chan, G. L., Yang, J., & Chieu, H. L. (2017). Universal dependencies parsing for colloquial Singaporean English. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1732–1744). Association for Computational Linguistics.

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.