Background
Developing chatbots for specialized domains is often hindered by data scarcity and informal linguistic variations (low-resource domains), which significantly degrade model relevance and performance.
In the Indonesian context, this data scarcity is intensified by vast linguistic diversity, encompassing formal, informal, and numerous regional dialects. Capturing these semantic nuances is difficult for traditional systems, which frequently fail to recognize user intent when phrasing deviates from standard linguistic patterns.
The Low-Resource Challenge
Data Scarcity
Specialized institutional domains, such as academic services at FILKOM UB, lack massive, diverse training datasets needed to train generative models from scratch.
Morphological Complexity
Indonesian dialectal variations and informal structures mean the same intent can be expressed in vastly different syntactic forms.
Overfitting Risks
Standard fine-tuning involves updating all model parameters, which is computationally expensive and prone to overfitting in low-data regimes.
Architectural Gap
Standard Transformer decoders struggle to maintain generation quality and context alignment without rigorous guidance in zero-shot or few-shot scenarios.
The Proposed Solution
I designed a Parameter-Efficient Fine-Tuning (PEFT) strategy leveraging a novel prefix-tuned encoder-decoder architecture. By synergizing a Semantic-Based Model with a Generative-Based Model, this framework acts as a highly capable engine without needing full retraining.
Semantic Encoder
IndoSBERT-Large
MLP Adapter
Projects Vector to Continuous Prefix
Generative Decoder
GPT-2 Medium
Output: Contextually Generated Answer
Semantic Encoder (IndoSBERT-Large)
A pretrained encoder processes structurally diverse user inputs into a holistic, fixed-dimensional semantic vector, clustering dialectal variations that share the same intent.
MLP Adapter Mapping
The semantic vector is mathematically projected via a Multi-Layer Perceptron adapter to become continuous trainable vectors (prefixes) serving as soft prompts.
Generative Decoder (GPT-2 Medium)
These continuous semantic prefixes are prepended to the frozen GPT-2 Medium decoder, guiding the auto-regressive text generation strictly according to the semantic intent.
Semantic Encoding & Clustering
The core advantage of IndoSBERT-Large in this architecture is its ability to cluster semantically identical queries (formal queries vs. regional dialects) into tight, distinct groups. Standard token-by-token processing fails here due to informal variations.
Traditional Approach
Proposed Approach (IndoSBERT)
Ablation Study & Hyperparameter Optimization
Rigorous hyperparameter tuning and ablation studies were conducted to determine the optimal configuration. The results validate that explicit semantic guidance and a specific prefix token length are strictly necessary to prevent severe performance degradation.
Model A: Full Benchmark
IndoSBERT-Large + GPT-2 Medium (3 Prefix Tokens).
Model B: Semantic Ablation
Encoder removed. Prefixes initialized as randomly trainable vectors.
Model C: Pretraining Ablation
Encoder architecture maintained but trained entirely from scratch.
Grid Search Highlights
| Model Encoder | Model Decoder | Prefix Tokens | BLEU | chrF |
|---|---|---|---|---|
| IndoSBERT-Large | GPT2-Medium (Indo) | 3 | 0.782 | 89.672 |
| IndoSBERT-Large | GPT2-Large (Indo) | 3 | 0.757 | 86.971 |
| IndoSBERT-Large | GPT2-Medium (Indo) | 5 | 0.724 | 85.872 |
| Transformer (Baseline) | Transformer | 0 | 0.653 | 77.654 |
| IndoBERT-Large | GPT2-Medium (Indo) | 0 | 0.591 | 70.941 |
Results Achieved
Evaluated against a standard baseline Transformer model trained from scratch, the proposed prefix-tuned architecture displayed remarkable improvements:
Want to dive deeper into the research?
The full academic paper is available, detailing the ablation studies, cosine similarity embeddings, and hyperparameter tuning phases.
Read the Full Paper →