University of Mysore · September 2023
The original thesis pages.
Abstract
A summarization system for Pashto.
This project proposes the development of a text summarization system for the Pashto language using natural-language-processing techniques and machine-learning algorithms. The system extracts the main points of an article or long passage using the NLTK package adapted to Pashto, with tokenization, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named-entity recognition all tuned for the language.
The front-end is built with HTML, CSS, Bootstrap and JavaScript. MySQL serves as the database, and the Flask framework of Python powers the back-end. The system is expected to produce efficient, accurate summaries of Pashto texts, making it a valuable tool for content analysis and decision-making in Pashto-speaking regions, with a back-end designed to handle large volumes of text efficiently.
Chapter 1 · Introduction
Why Pashto. Why now.
1.1 Preamble
Pashto is spoken across vast regions of the world — most notably across Afghanistan — and yet the tools for processing it computationally are scarce. The absence of efficient summarization systems for Pashto impedes journalists, researchers, students, and ordinary readers from extracting value from long-form Pashto writing. This project takes that absence as its starting point.
The proposed system combines NLTK with BERT (Bidirectional Encoder Representations from Transformers) to produce contextualised word embeddings and improve summary quality. The front-end is HTML/CSS/Bootstrap/JavaScript; the back-end is Flask + MySQL.
1.2 Existing systems
Pashto summarization is an under-served area of NLP. Most published work targets English, Arabic, or major European languages. Where Pashto tools exist, they tend to be brittle, narrow, or unmaintained.
1.3 Motivation
Sixty million Pashto speakers deserve software that treats their language as a first-class citizen, not a translation afterthought. The motivation is straightforward: if the tools don't exist, build them.
1.4 Proposed system
A web application where a user pastes Pashto text and receives a clean summary, generated by a BERT-based pipeline that understands the language's grammar and morphology. Reading time before vs. after is shown on the result page.
1.5 Objectives
- Build an NLP pipeline that works natively in Pashto (not transliterated).
- Integrate BERT to produce summaries that capture context, not just keyword frequency.
- Ship a web interface anyone can use without a developer in the loop.
- Lay the groundwork for future Pashto-language tools.
1.6 Applications
News rooms processing wire feeds. Researchers reading long Pashto documents. Students summarising long-form sources. Government and humanitarian organisations digesting reports. Educators producing teaching material.
Chapter 2 · Literature review
Standing on the work that came before.
The proposal builds on three landmark papers:
- Vaswani et al. (2017) — "Attention is all you need." The original Transformer paper. Replaces recurrence with self-attention; everything modern in NLP descends from it.
- Devlin et al. (2018) — "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Establishes the bidirectional pre-training that this thesis adapts to Pashto.
- Paulus et al. — "A Deep Reinforced Model for Abstractive Summarization." Sets the abstractive-summarization baseline that the project aspires to.
Adjacent work on extractive summarization (Nallapati et al., Erkan & Radev's LexRank) informs the extractive fallbacks used during early experimentation.
Chapter 3 · Software requirements
What the system has to do.
Functional requirements
- Accept Pashto text input through a web interface.
- Validate the input is well-formed Pashto.
- Produce a summary as Pashto output.
- Report reading-time before and after summarization.
- Allow the user to clear, edit, and re-summarize.
Non-functional requirements
- Summarization must complete in seconds on commodity hardware.
- The interface must be usable on any modern browser.
- The system must scale to very long input documents.
- The back-end must be portable across operating systems.
Feasibility
Technical, economic, operational, and social feasibility were assessed in the original thesis; the conclusion was that the project was implementable with off-the-shelf tooling on a single-developer budget, which proved correct.
Chapter 4 · Methodology
How the pipeline works.
The summarization pipeline is a series of stages, each adapted to Pashto.
- Tokenization. Splitting Pashto text into words and sentences, respecting the language's punctuation conventions.
- Stop-word removal. Filtering out high-frequency Pashto function words that contribute little to meaning.
- Stemming. Reducing words to root forms while preserving Pashto's morphological richness.
- BERT embedding. Generating contextual vectors for each sentence using a BERT model finetuned on Pashto data.
- Scoring & ranking. Each sentence is scored on its semantic centrality to the document.
- Summary assembly. The top-ranked sentences are returned in their original document order.
This is a hybrid extractive/abstractive approach: the model identifies the most important sentences (extractive) but uses contextual embeddings (the abstractive insight) to decide what "important" means.
Chapter 5 · System design
Three layers, cleanly separated.
The system is a classic three-tier web application.
- Presentation tier. HTML, CSS, Bootstrap and vanilla JavaScript. A single input page and a result page. No framework, no build step.
- Application tier. Flask (Python) exposes a single endpoint that accepts Pashto text and returns a summary. All NLP logic lives behind this endpoint.
- Data tier. MySQL stores user inputs, summaries, and metadata for evaluation and future model improvements.
The thesis includes a full architecture diagram, data-flow diagram, use-case diagram, and entity-relationship diagram. The summary: input flows from the user to Flask, through the NLP pipeline, and back to the user as a rendered HTML response.
Chapter 6 · Implementation
Tools, modules, and what got built.
Tools and technologies
- Python 3 as the implementation language.
- NLTK for tokenization, stop-word lists, and stemming primitives.
- Hugging Face Transformers for BERT model loading and inference.
- Flask as the web framework.
- MySQL for persistence.
- HTML / CSS / Bootstrap / JavaScript for the front-end.
Modules
- Input module. Captures raw Pashto text and validates it.
- Preprocessing module. Tokenization, stop-word removal, stemming.
- Embedding module. Generates BERT embeddings per sentence.
- Ranking module. Scores and orders sentences by centrality.
- Output module. Renders the summary with reading-time metrics.
Chapter 7 · System testing
How we knew it worked.
The system was exercised end-to-end across multiple test categories.
- Unit testing on each preprocessing step.
- Integration testing across the full pipeline.
- Interface testing to verify the web UI is usable by people who didn't build it.
- Regression testing to catch breakage when models or dependencies changed.
- Acceptance testing with end-users.
- Usability, UI, performance, security — all assessed.
Two key test cases:
- Input: a valid Pashto paragraph. Expected: a Pashto summary. Result: successful.
- Input: invalid (non-Pashto / malformed) text. Expected: rejected with an error. Result: successful.
Chapter 8 · Results and discussions
The system shipped. Pashto got summarized.
The system effectively generated concise, contextually-relevant summaries from lengthy Pashto texts. BERT's natural-language-processing capabilities were able to comprehend context and produce accurate compressions of the source material. The user-friendly front-end simplified interaction. The Flask back-end handled substantial datasets without faltering.
The introduction of BERT into Pashto text summarization is, in the language's tooling history, a significant advancement. BERT excels at capturing nuanced context, which matters disproportionately in Pashto due to its morphological complexity and dialectal variation. The system's potential impact extends to journalism, academic research, and education — anywhere Pashto speakers and writers need to extract meaning from long documents quickly.
The thesis was submitted, defended, and approved. The system is on file with the university and on this site.
Chapter 9 · From thesis to LEIK
Where the work went next.
The thesis closed with four future directions: better contextual understanding, more regional languages, learning from user feedback, voice interfaces. LEIK is that list, made real. What began as a summarizer for written Pashto is now a captioning tool for the spoken language, where every Pashto voice is paired with English and both are burned into the video in minutes. The scope grew from text to voice, then voice to video, but the motivation didn't move. Sixty million Pashto speakers, almost no software that respects their language. The thesis was a small move against that. LEIK is the larger one.
Chapter 10 · References
The shoulders we stood on.
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., …& Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., …& Polosukhin, I. (2017). Attention is all you need. NeurIPS, 5998–6008.
- Rush, A. M., Chopra, S., & Weston, J. (2015). A Neural Attention Model for Abstractive Sentence Summarization. arXiv:1509.00685.
- Nallapati, R., Zhai, F., & Zhou, B. (2016). SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents. AAAI.
- Zhang, X., Zhao, J., & LeCun, Y. (2018). Self-attentive Embeddings for Natural Language Processing. NeurIPS.
- Wu, F., Wang, T., Liu, Z., & Gao, X. (2019). A Study of Extractive Summarization using Multi-task Learning and Saliency-based Focus. IPM 56(2), 346–358.
- Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. NeurIPS.
- Erkan, G., & Radev, D. R. (2004). LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. JAIR 22, 457–479.
- Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don't Give Me the Details, Just the Summary!