Pre-Conference Talk by LIM Jia Peng | A Partition Cover Approach to Tokenization

Please click here if you are unable to view this page.

A Partition Cover Approach to Tokenization

Speaker (s):

LIM Jia Peng
PhD Candidate
School of Computing and Information Systems
Singapore Management University

Date:

Time:

Venue:

5 November 2025, Wednesday

3:30pm – 3:50pm

Meeting room 5.1, Level 5
School of Computing and
Information Systems 1,
Singapore Management University,
80 Stamford Road,
Singapore 178902

We look forward to seeing you at this research seminar.

Please register by 3 November 2025.

About the Talk

Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GREEDTOK. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple (1 − 1/e)-approximation algorithm GREEDWMC. Through empirical evaluations on real-world corpora, we show that GREEDTOK outperforms BPE and UNIGRAM on compression and achieves a covering score comparable to GREEDWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GREEDTOK as the tokenizer, shows that GREEDTOK achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.

This is a Pre-Conference talk for The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025).

About the speaker

LIM Jia Peng is a Ph.D. candidate in Computer Science at the SMU School of Computing and Information Systems, supervised by Associate Professor Hady W. Lauw. His research mainly focuses on Natural Language Processing.

Where to find us

Get in touch