DOI:
https://doi.org/10.64539/sjcs.v2i1.2026.380Keywords:
Hallucination reduction, Knowledge graph, Large language models, Legal question answering, Retrieval-augmented generation, Self-reflection, Vietnamese natural language processingAbstract
Legal question answering is essential for compliance, dispute resolution, and everyday HR decision-making, yet large language models may produce persuasive but incorrect legal statements when supporting evidence is incomplete. While Retrieval-Augmented Generation and graph-based retrieval can ground responses in statutes and structured relations, Vietnamese legal QA often lacks an explicit, automated quality-control step that scores an answer, decides whether it should be refined, and checks that citations are actually supported. In this paper, we propose a self-reflection mechanism that adds an iterative generate–evaluate–refine loop to a Graph-RAG pipeline for Vietnamese labor-law questions. Each draft is evaluated with a hybrid score that combines how closely the answer matches retrieved legal context with a model-derived confidence estimate, and the system iterates until it reaches a quality threshold or a stopping limit. On a Vietnamese Labor Law benchmark, the approach improves accuracy from 81.5% to 86.7% and reduces hallucination from 18.7% to 9.3%, with only a modest increase in end-to-end latency in typical use. We also examine component contributions and remaining failure cases, finding that pairing contextual alignment with confidence produces more stable answers than relying on a single signal. These results indicate that self-reflection can serve as a lightweight, deployment-friendly safety layer for high-stakes legal QA without requiring additional labeled data or model fine-tuning, and it can be adapted to other Vietnamese legal domains that demand transparent, article- and clause-level evidence.
References
[1] W. X. Zhao et al., “A Survey of Large Language Models,” arXiv preprint arXiv:2303.18223, 2023. https://doi.org/10.48550/arXiv.2303.18223.
[2] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, D. Kiela, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” in Proceedings of the 34th International Conference on Neural Infor-mation Processing System, pp. 9459–9474, 2020. https://dl.acm.org/doi/abs/10.5555/3495724.3496517.
[3] Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” arXiv preprint arXiv:2312.10997, 2023. https://doi.org/10.48550/arXiv.2312.10997.
[4] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-T Yih, “Dense Pas-sage Retrieval for Open-Domain Question Answering,” in Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020. https://aclanthology.org/2020.emnlp-main.550.pdf.
[5] C. Yang, R. Xu, L. Luo, and S. Pan, “Knowledge Graph and Large Language Model Co-learning via Structure-Oriented Retrieval-Augmented Generation,” IEEE Data Engineering Bulletin, vol. 47, no. 1, pp. 9–46, 2024. https://par.nsf.gov/servlets/purl/10590165.
[6] D. V. Dang et al., “Knowledge Graph-Based Legal Query System with LLM and Retrieval Augmented Generation,” in Recent Challenges in Intelligent Information and Database Systems (ACIIDS), Communications in Computer and Information Science, vol. 2493, pp. 161–172, 2025. https://doi.org/10.1007/978-981-96-5881-7_13.
[7] Z. Ji et al., “Survey of Hallucination in Natural Language Generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023. https://doi.org/10.1145/3571730.
[8] L. Huang et al., “A Survey on Hallucination in Large Language Models: Principles, Taxon-omy, Challenges, and Open Questions,” arXiv preprint arXiv:2311.05232, 2023. https://doi.org/10.48550/arXiv.2311.05232.
[9] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On Faithfulness and Factuality in Ab-stractive Summarization,” in Proc. of the 58th Annual Meeting of the Association for Compu-tational Linguistics, pp. 1906–1919, 2020. https://doi.org/10.18653/v1/2020.acl-main.173.
[10] N. Shinn, F. Cassano, A. Gopinath, and S. Yao, “Reflexion: Language Agents with Verbal Re-inforcement Learning,” Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 8634–8652, 2023. https://doi.org/10.48550/arXiv.2303.11366.
[11] A. Madaan et al., “Self-Refine: Iterative Refinement with Self-Feedback,” Advances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 46512–46594, 2023. https://doi.org/10.48550/arXiv.2303.17651.
[12] Ministry of Justice of Vietnam, “Circular No. 25/2011/TT-BTP on the Format of and Tech-niques for Presenting Legal Documents of the Government, the Prime Minister, Ministers and Heads of Ministerial-Level Agencies and Joint Legal Documents,” Hanoi, Vietnam, 2011. Available: https://thuvienphapluat.vn/van-ban/EN/Linh-vuc-khac/Circular-No-25-2011-TT-BTP-on-the-format-of-and-techniques-for-presenting-legal/136968/tieng-anh.aspx.
[13] National Assembly of Vietnam, “Labor Code No. 45/2019/QH14,” Hanoi, Vietnam, 2019. https://english.luatvietnam.vn/labor-code-no-45-2019-qh14-dated-november-20-2019-of-the-national-assembly-179015-doc1.html.
[14] Y. Meng, J. Huang, Y. Zhang, and J. Han, “Generating Training Data with Language Mod-els: Towards Zero-Shot Language Understanding,” arXiv preprint arXiv:2202.04538, 2022. https://doi.org/10.48550/arXiv.2202.04538.
[15] S. Robertson and H. Zaragoza, “The Probabilistic Relevance Framework: BM25 and Be-yond,” Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009. https://dl.acm.org/doi/abs/10.1561/1500000019.
[16] J. J. Jiang and D. W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy,” in Proc. 10th Research on Computational Linguistics International Conference (ROCLING X), pp. 19–33, 1997. https://arxiv.org/abs/cmp-lg/9709008.
[17] J. Metcalfe and A. P. Shimamura, Metacognition: Knowing about Knowing. Cambridge, MA, USA: MIT Press, 1994. https://doi.org/10.7551/mitpress/4561.001.0001.
[18] X. Wang et al., “Self-Consistency Improves Chain-of-Thought Reasoning in Language Mod-els,” in arXiv preprint arXiv:2203.11171, 2023. https://doi.org/10.48550/arXiv.2203.11171.
[19] J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv preprint arXiv:2201.11903 , 2022. https://doi.org/10.48550/arXiv.2201.11903.
[20] S. Kadavath et al., “Language Models (Mostly) Know What They Know,” arXiv preprint arXiv:2207.05221, 2022. https://doi.org/10.48550/arXiv.2207.05221.
[21] S. Chen et al., “FELM: Benchmarking Factuality Evaluation of Large Language Models,” Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, vol. 36, 2023. https://doi.org/10.48550/arXiv.2310.00741.
[22] Y. Du et al., “Improving Factuality and Reasoning in Language Models through Multiagent Debate,” in Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR, vol. 235, pp. 11733–11763, 2024. https://proceedings.mlr.press/v235/du24e.html.
[23] L. Pan et al., “Fact-Checking Complex Claims with Program-Guided Reasoning,” in Pro-ceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6981–7004, 2023. https://doi.org/10.18653/v1/2023.acl-long.386.
[24] J. Huang and K. C. Chang, “Towards Reasoning in Large Language Models: A Survey,” Findings of the Association for Computational Linguistics (ACL), pp. 1049–1065, 2023. https://doi.org/10.18653/v1/2023.findings-acl.67.
[25] H. Q. Ngo, H. D. Nguyen, and N.-A. Le-Khac, “Ontology Knowledge Map Approach To-wards Building Linked Data for Vietnamese Legal Applications,” Vietnam Journal of Com-puter Science, vol. 11, no. 2, pp. 323–342, 2024. https://doi.org/10.1142/S2196888824500015.
[26] V. T. Pham, H. D. Nguyen, T. Le, B. Nguyen, and H. Q. Ngo, “Ontology-Based Solution for Building an Intelligent Searching System on Traffic Law Documents,” in Proc. International Conference on Agents and Artificial Intelligence (ICAART), vol. 1, pp. 217–224, 2023. https://doi.org/10.5220/0011635500003393.
[27] H. Nguyen et al., “Intelligent Search System for Resume and Labor Law Using Knowledge Graphs,” PeerJ Computer Science, vol. 10, e1786, 2024. https://doi.org/10.7717/peerj-cs.1786.
[28] D. V. Dang, V. T. Pham, T. Cao, N. Do, H. Q. Ngo, and H. D. Nguyen, “A Practical Approach to Leverage Knowledge Graphs for Legal Query,” in Intelligent Systems and Data Science, vol. 1949, Springer, pp. 271–284, 2024. https://doi.org/10.1007/978-981-99-7649-2_21.
[29] D. Q. Nguyen and A. T. Nguyen, “PhoBERT: Pre-trained Language Models for Vietnamese,” in Findings of the Association for Computational Linguistics: EMNLP, pp. 1037–1042, 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.92.
[30] T. Vu et al., “VnCoreNLP: A Vietnamese Natural Language Processing Toolkit,” in Proceed-ings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 56–60, 2018. https://doi.org/10.18653/v1/N18-5012.
[31] N. Minh et al., “ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining,” in Proceedings of the Thirteenth Language Resources and Evaluation Confer-ence, pp. 328–337, 2022. https://aclanthology.org/2022.lrec-1.35/.
[32] A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning at Scale,” in Pro-ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, 2020. https://doi.org/10.48550/arXiv.1911.02116.
[33] I. Chalkidis et al., “FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing,” in Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, 2022. https://doi.org/10.18653/v1/2022.acl-long.301.
[34] Y. Zhang et al., “Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models,” arXiv preprint arXiv:2309.01219, 2023. https://doi.org/10.48550/arXiv.2309.01219.
[35] P. Manakul, A. Liusie, and M. J. F. Gales, “SelfCheckGPT: Zero-Resource Black-Box Halluci-nation Detection for Generative Large Language Models,” in Proceedings of the 2023 Con-ference on Empirical Methods in Natural Language Processing, pp. 9004–9017, 2023. https://doi.org/10.18653/v1/2023.emnlp-main.557.
[36] M. Zhong, Y. Liu, D. Yin, Y. Mao, Y. Jiao, P. Liu, C. Zhu, H. Ji, and J. Han, “Towards a Uni-fied Multi-Dimensional Evaluator for Text Generation,” in Proceedings of the 2022 Confer-ence on Empirical Methods in Natural Language Processing, pp. 2023–2038, 2022. https://doi.org/10.18653/v1/2022.emnlp-main.131.
[37] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, 2019. https://doi.org/10.18653/v1/D19-1410.
[38] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representa-tions in Vector Space,” arXiv preprint arXiv:1301.3781, 2013. https://doi.org/10.48550/arXiv.1301.3781.
[39] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016. https://books.google.com/books/about/Deep_Learning.html?id=Np9SDQAAQBAJ.
[40] H. Liu et al., “Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning,” Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 19587–19599,2022. https://doi.org/10.48550/arXiv.2205.05638.
[41] C. Gao et al., “StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving,” arXiv preprint arXiv:2311.08803, 2023. https://doi.org/10.48550/arXiv.2311.08803.
[42] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cam-bridge, UK: Cambridge University Press, 2008. Available: https://books.google.com/books/about/Introduction_to_Information_Retrieval.html?id=t1PoSh4uwVcC.
[43] S. E. Robertson, “The Probability Ranking Principle in IR,” Journal of Documentation, vol. 33, no. 4, pp. 294–304, 1977. https://doi.org/10.1108/eb026647.

