An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection
arXiv SecurityArchived Jun 08, 2026✓ Full text saved
arXiv:2606.06879v1 Announce Type: cross Abstract: Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Computation and Language
[Submitted on 5 Jun 2026]
An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection
Carl Lochstampfor, Ayan Roy
Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71\% accuracy and 0.7786 macro F1 compared with 78.43\% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7\times improvement in label correction rate, from 49.8\% to 3.9\%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1\% to 46.5\%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as: arXiv:2606.06879 [cs.CL]
(or arXiv:2606.06879v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2606.06879
Focus to learn more
Submission history
From: Ayan Roy [view email]
[v1] Fri, 5 Jun 2026 03:46:36 UTC (28 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CL
< prev | next >
new | recent | 2026-06
Change to browse by:
cs
cs.CR
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)