Building a Topic Model Pipeline for Turkish Social Media Text
Working with Turkish text for topic modeling surfaces a few problems that don't show up as often in English-language NLP work. This post walks through the preprocessing decisions that mattered most and how three topic modeling approaches compared on the same dataset.
Why preprocessing takes longer than modeling
Turkish is agglutinative — a single root can carry many suffixes for case, tense, and possession. Without proper lemmatization, a topic model sees teknoloji, teknolojiyi, and teknolojisinin as three unrelated tokens instead of one concept. That fragments topics and makes them harder to interpret.
Zeyrek, a morphological analyzer for Turkish, handles this reasonably well but isn't perfect — ambiguous roots sometimes resolve to the wrong lemma, especially in informal, social-media-style text full of abbreviations and slang.
Comparing LDA, NMF, and BERTopic
Three approaches were tried on the same cleaned corpus:
- LDA produced coherent topics but required careful tuning of the number of topics; too few merged distinct themes, too many fragmented a single theme across clusters.
- NMF converged faster and gave slightly more interpretable topics for this dataset, likely because the term-document matrix was already fairly sparse after aggressive stopword removal.
- BERTopic handled short, informal posts better than either classical method, since it clusters on sentence embeddings rather than raw term frequency — but it's considerably more expensive to run at scale.
What I'd do differently next time
Spend more time validating the lemmatizer's output on a sample before running it across the full corpus. A handful of systematic lemmatization errors early on quietly degraded topic quality in ways that only became obvious after several modeling iterations.