💼 Room 715, ECE Building, BUET
+880 255167100 (ext: 6536) (office)
+880 1740 430873 (mobile)
📧 jarez@iict.buet.ac.bd

Dr. Md. Jarez Miah

Assistant Professor

Institute of Information and Communication Technology (IICT)

Bangladesh University of Engineering and Technology (BUET)


Kawsar Ahmed

Thesis title: Abstractive Bangla Text Summarization using Text-to-Text Transfer Transformer

Abstract—Text summarization provides readers with a concise overview of long texts, making it easier to grasp the main points quickly and efficiently while saving both time and effort. Modern transformer models have made textcan efficiently summarization summarize textmuch better in many languages. However, using them for Bangla Baengali language (also known as Bangla by the native language) is still difficult challenging because of the shortage of there isn’t enough training data, and there aren’t many explicit pre-trained models for BanglaBangali. This research aims to address this address these challenges by creating developinga Bangla summarization models based on Ttext-to-text transfer transformer (T5)5, trained fully explicitly on Bangla Bangali text data. The process includes everything from scratch, pretraining to fine-tuning of the models, through careful data collection from web crawls and open-source available pre-trained datasets, along with custom summary datasets for fine-tuning.

To achieve this, we first trained a custom tokenizer using both our crawled datasets and open-source available Beangalia datasets to ensure it wasthat the tokenizer is well-suited for Beangalai language. Using this tokenizer, we then pre-trained T5-based models , which is followed by on all the collected data. After pretraining, we fine-tuninged the models with two datasets: a 24k-example dataset which we specifically created for the summarization task and the Beangalia portion of the publicly available XLSUM dataset.

In our researchthis thesis, we developed and tested two models, bnT5-32k and bnT5-64k, using tokenizers with 32,000 and 64,000 vocabulary sizes, respectively. The models were evaluated by calculating ROUGE scores. According to ROUGE scores, tThe bnT5-64k model delivered the best performance, demonstrating that a larger vocabulary helps produce more accurate and meaningful summaries. The bnT5-32k model also performed well, but its performance was slightly lower inferior than that of bnT5-64k. We also compared assessed our models with in comparison with two other available models: mT5, which is multilingual, and BanglaT5. We evaluated how allcompared the performance of all the models performed on the same dataset, considering slight changes in architectures and vocabulary sizes. It was observed that mT5-base and BanglaT5 struggled, especially in capturing key phrases and maintaining the overall meaning of the text. These findings show that our bnT5 models can greatly improve Beangalia text summarization and provide a foundation for future research in the Bangla Bengali language, especially in trying different model designs.