NLP Implementation For AI Generated Text Detection (ChatGPT) Using Naive Bayes Method

Authors

  • Rafel Fernando Universitas Duta Bangsa Surakarta
  • Yuliana Dewi Proboningrum Universitas Duta Bangsa Surakarta
  • Septi Dwi Supriati Universitas Duta Bangsa Surakarta
  • Nurmalitasari Nurmalitasari Universitas Duta Bangsa Surakarta

DOI:

https://doi.org/10.32664/j-intech.v13i02.2026

Keywords:

NLP, Naive bayes Multinomial, Detection, Artificial Intelligence, ChatGPT

Abstract

The development of artificial intelligence (AI) technology, especially large language models like ChatGPT, presents challenges related to the authenticity and validity of digital content. AI's ability to produce human-like text opens up opportunities for misuse, such as plagiarism and information manipulation. This study aims to develop an AI text detection system using the Multinomial Naive Bayes algorithm, due to its ease of use and high effectiveness algorithm has become a popular choice for text classification.. The dataset used is the Human ChatGPT Comparison Corpus (H3C), sourced from the ELI5 subreddit on Reddit, consisting of 800 entries of questions and answers from both humans and AI. The labeling process involves combining answers into a single column and assigning labels based on the source. Preprocessing steps include case folding, removal of digits and punctuation, tokenization, stopword removal, normalization, and text finalization. Text features are extracted using the TF-IDF method, limited to the top 1000 features. The model is trained on 80% of the data and tested on the remaining 20%. The evaluation shows an accuracy of 93%. These findings suggest that the Naive Bayes method is effective in distinguishing AI-generated from human-generated text and has potential as an automatic AI content detection tool.

References

5. References

[1] L. Lim and S. Siripipatthanakul, “a Review of Artificial Intelligence (Ai) and Chatgpt Influencing the Digital Economy,” no. December, pp. 2828–4925, 2023, doi: 10.47841/icorad.v2i2.139.

[2] & Y. W. Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, “How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection”, doi: arXiv:2301.07597.

[3] Y. Su and Y. Wu, Robust Detection of LLM-Generated Text: A Comparative Analysis, vol. 1, no. 1. Association for Computing Machinery, 2024.

[4] D. J. Y. Y. G.-A. Odri, “Detecting generative artificial intelligence in scientific articles: Evasion techniques and implications for scientific integrity”, doi: 10.1016/j.otsr.2023.103706.

[5] D. Biris, “Deep Learning Approaches for Detecting Text Generated by Artificial Intelligence,” Studia Universitatis Babeș-Bolyai Informatica, vol. 69, no. 2, pp. 39–58, Apr. 2025, doi: 10.24193/subbi.2024.2.03.

[6] N. Islam, D. Sutradhar, H. Noor, J. T. Raya, M. T. Maisha, and D. M. Farid, “Distinguishing Human Generated Text From ChatGPT Generated Text Using Machine Learning,” arXiv preprint, 2023, [Online]. Available: http://arxiv.org/abs/2306.01761

[7] A. Yadagiri, L. Shree, S. Parween, A. Raj, S. Maurya, and P. Pakray, “Detecting AI-Generated Text with Pre-Trained Models using Linguistic Features,” Proceedings of the 21st International Conference on Natural Language Processing (ICON), pp. 188–196, 2024.

[8] D. O. Sihombing, “Implementasi NLP dan Cosine Similarity dalam Penilaian Ujian Esai Otomatis,” Jurnal Sistem Komputer dan Informasi, vol. 4, no. 2, p. 396, 2022, doi: 10.30865/json.v4i2.5374.

[9] F. Novianti, K. Rizky, and N. Wardani, “Analisis Sentimen Masyarakat Terhadap Data Tweet Traveloka Menggunakan Naïve Bayes,” JIPI, vol. 8, no. 3, pp. 922–993, 2023, doi: 10.29100/jipi.v8i3.3973.

[10] L. Azzahrah, “Naive Bayes Algorithm and TF-IDF for Detecting Plagiarism”, doi: 10.33558/piksel.v12i2.9829.

[11] A. Shah, P. Ranka, U. Dedhia, and others, “Detecting and Unmasking AI-Generated Texts through Explainable Artificial Intelligence using Stylistic Features,” International Journal of Advanced Computer Science and Applications, vol. 14, no. 10, pp. 1043–1053, 2023, doi: 10.14569/IJACSA.2023.01410110.

[12] L. S. Chanda, “Implementasi Algoritma Multinomial Naïve Bayes untuk Deteksi AI Generated Text”.

[13] D. S. Kashid, J. D. Patil, and A. Buchade, “Live News Classification Using Naive Bayes Classifier,” 2025, doi: 10.7759/s44389-024-01030-8.

[14] R. Rinaldi and R. Goejantoro, “Penerapan Metode Multinomial Naive Bayes: Studi Kasus PT Prudential Life Samarinda,” Eksponensial, vol. 12, pp. 111–118, 2021, doi: 10.30872/eksponensial.v12i2.803.

[15] W. Widyawati and S. Sutanto, “Perbandingan Kinerja Naïve Bayes Multivariate dan Multinomial,” Journal of Innovative Future Technology, vol. 2, no. 1, pp. 108–125, 2020, doi: 10.47080/iftech.v2i1.859.

[16] P. S. Putra, “Komentar Situs Reddit dengan Metode Lexicon Based,” 2024.

[17] N. Prova, “Detecting AI Generated Text Based on NLP and Machine Learning Approaches,” 2024.

[18] J. M. Polgan and others, “Algoritma Naïve Bayes untuk Mengidentifikasi Hoaks di Media Sosial,” Jurnal Manajemen dan Profesi, vol. 13, pp. 2020–2025, 2024, doi: 10.33395/jmp.v13i1.13937.

[19] A. S. R. Roba, S. Lailiyah, and A. Yusnita, “Application of Naive Bayes Algorithm for Analysis of User Reviews on Mobile Legends Game: Bang Bang,” J-INTECH, pp. 140–147, 2025, doi: 10.32664/j-intech.v13i01.1881.

Downloads

Published

2025-12-19