ENGtoID — David Long

← back

ENGtoID

Seq2Seq · LSTM · PyTorch · NLP · Machine Translation · Data Preprocessing · Tokenization

Overview

ENGtoID is a sequence-to-sequence neural machine translation system that translates English sentences into Bahasa Indonesia using an LSTM encoder-decoder architecture.

The plots show the training and validation performance of the model, demonstrating dips in accuracy caused by experimental teacher forcing

The project was built as an end-to-end ML pipeline, covering dataset ingestion, cleaning, vocabulary construction, training, evaluation, and error analysis at scale (~1M sentence pairs).

Pipeline Design

Dataset download from HuggingFace (opus-Indonesian-to-English)
Regex-based cleaning and filtering of noisy sentence pairs
Vocabulary construction with ~250k combined tokens
Token → index encoding with fixed special tokens
PyTorch Dataset + DataLoader pipeline for training

Model Architecture

Encoder: LSTM encodes English sentence into fixed context vector
Decoder: autoregressive LSTM generates Indonesian tokens
Loss: CrossEntropy with padding ignored
Embedding size: 512 hidden dimensions

Training Strategy

Trained on ~1M sentence pairs using RTX 4090 (cloud)
Multi-phase learning rate schedule (from 1e-3 → 1e-5)
Teacher forcing ratio gradually reduced to improve inference robustness
Batch size adjustments to stabilise later-stage training

Key Engineering Decisions

Adam optimizer for stable convergence on large-scale dataset
Single-layer LSTM chosen over deeper stacked variants for capacity efficiency
Explicit vocabulary control to ensure reproducibility
Manual dataset pipeline for full training transparency

Evaluation

Token-level accuracy used for training monitoring
BLEU scoring used for validation evaluation
Observed plateau after early training epochs (~50%+ accuracy)
Validation set significantly smaller than training set (~0.2%)

Error Analysis

Punctuation and sentence termination errors (., ,, <EOS>)
Word order inconsistencies in generated output
Difficulty with pronouns and function words
Frequent <UNK> substitutions for rare tokens

Limitations

Whitespace tokenisation (no BPE / subword modelling)
Vocab coupling to training artifacts (non-portable checkpoints)
Weak validation split (~2k samples)
LSTM architecture limits long-range dependency modelling

Future Improvements

Replace tokenizer with BPE or SentencePiece
Upgrade to Transformer architecture with attention
Improve dataset splitting for proper generalisation metrics
Add model checkpoint migration tooling