showSidebars ==
showTitleBreadcrumbs == 1
node.field_disable_title_breadcrumbs.value ==

Pre-Conference Talk by LIM Jia Peng | A Partition Cover Approach to Tokenization

Please click here if you are unable to view this page.

 


A Partition Cover Approach to Tokenization
 

Speaker (s):


LIM Jia Peng
PhD Candidate
School of Computing and Information Systems
Singapore Management University

Date:

Time:

Venue:

 

5 November 2025, Wednesday

3:30pm – 3:50pm

Meeting room 5.1, Level 5
School of Computing and
Information Systems 1, 
Singapore Management University, 
80 Stamford Road,
Singapore 178902

We look forward to seeing you at this research seminar.

Please register by 3 November 2025.

About the Talk

Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GREEDTOK. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple (1 − 1/e)-approximation algorithm GREEDWMC. Through empirical evaluations on real-world corpora, we show that GREEDTOK outperforms BPE and UNIGRAM on compression and achieves a covering score comparable to GREEDWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GREEDTOK as the tokenizer, shows that GREEDTOK achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

This is a Pre-Conference talk for The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025).

About the speaker

LIM Jia Peng is a Ph.D. candidate in Computer Science at the SMU School of Computing and Information Systems, supervised by Associate Professor Hady W. Lauw. His research mainly focuses on Natural Language Processing.