| |
A Partition Cover Approach to Tokenization Speaker (s):  LIM Jia Peng PhD Candidate School of Computing and Information Systems Singapore Management University
| Date: Time: Venue: | | 5 November 2025, Wednesday 3:30pm – 3:50pm Meeting room 5.1, Level 5 School of Computing and Information Systems 1, Singapore Management University, 80 Stamford Road, Singapore 178902 We look forward to seeing you at this research seminar. Please register by 3 November 2025. 
|
|
About the Talk Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GREEDTOK. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple (1 − 1/e)-approximation algorithm GREEDWMC. Through empirical evaluations on real-world corpora, we show that GREEDTOK outperforms BPE and UNIGRAM on compression and achieves a covering score comparable to GREEDWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GREEDTOK as the tokenizer, shows that GREEDTOK achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.
This is a Pre-Conference talk for The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025). About the speaker LIM Jia Peng is a Ph.D. candidate in Computer Science at the SMU School of Computing and Information Systems, supervised by Associate Professor Hady W. Lauw. His research mainly focuses on Natural Language Processing.
|