T2T: Captioning Smartphone Activities Using Mobile Traffic
arXiv SecurityArchived Apr 08, 2026✓ Full text saved
arXiv:2604.05642v1 Announce Type: new Abstract: This paper studies the creation of textual descriptions of user activities and interactions on smartphones. Our approach of referring to encrypted mobile traffic exceeds traditional smartphone activity classification methods in terms of model scalability and output readability. The paper addresses two obstacles to the realization of this idea: the semantic gap between traffic features and smartphone activity captions, and the lack of textually anno
Full text archived locally
✦ AI Summary· Claude Sonnet
Computer Science > Cryptography and Security
[Submitted on 7 Apr 2026]
T2T: Captioning Smartphone Activities Using Mobile Traffic
Jiyu Liu, Yong Huang, Yanzhao Lu, Yun Tie, Wanqing Tu
This paper studies the creation of textual descriptions of user activities and interactions on smartphones. Our approach of referring to encrypted mobile traffic exceeds traditional smartphone activity classification methods in terms of model scalability and output readability. The paper addresses two obstacles to the realization of this idea: the semantic gap between traffic features and smartphone activity captions, and the lack of textually annotated traffic data. To overcome these challenges, we introduce a novel smartphone activity captioning system, called T2T (Traffic-to-Text). T2T consists of a flow feature encoder that converts low-level traffic characteristics into meaningful latent features and a caption decoder to yield readable transcripts of smartphone activities. In addition, T2T achieves the automatic textual annotation of mobile traffic by feeding synchronized screen capture videos into the Qwen-VL-Max vision-language model, and proposing multi-stage losses for effective cross-model training. We evaluate T2T on 40,000 traffic-description pairs collected in two real-world environments, involving 8 smartphone users and 20 mobile apps. T2T achieves a BLEU-4 score of 58.1, a METEOR score of 38.3, a ROUGE-L score of 70.5, and a CIDEr score of 108.7. The quantitative and qualitative analyses show that T2T can generate semantically accurate captions that are comparable to the vision-language model.
Comments: Accepted by IEEE International Joint Conference on Neural Networks 2026
Subjects: Cryptography and Security (cs.CR)
Cite as: arXiv:2604.05642 [cs.CR]
(or arXiv:2604.05642v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2604.05642
Focus to learn more
Submission history
From: Yong Huang [view email]
[v1] Tue, 7 Apr 2026 09:44:27 UTC (3,728 KB)
Access Paper:
HTML (experimental)
view license
Current browse context:
cs.CR
< prev | next >
new | recent | 2026-04
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Export BibTeX Citation
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Demos
Related Papers
About arXivLabs
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)