PhD Dissertation Proposal by TAN Minghuan | Chinese Idiom Understanding with Neural Network Models

Please click here if you are unable to view this page.

Chinese Idiom Understanding with Neural Network Models

TAN Minghuan

PhD Candidate

School of Information Systems

Singapore Management University

FULL PROFILE

Research Area

Data Science & Engineering

Dissertation Committee

Research Advisor

Associate Prof. Jing JIANG

Committee Members

Assistant Prof. FANG Yuan

Assistant Prof. GAO Wei

Date

11 August 2020 (Tuesday)

Time

9:30am - 10:30am

Venue

This is a virtual seminar. Please register by 10 August before 12:00pm, the webex link will be sent to those who have registered by 5:45pm.

We look forward to seeing you at this research seminar.

About The Talk

Chinese idioms are fixed phrases that have special meanings usually derived from an ancient story. The meanings of these idioms are oftentimes not directly related to their component characters. In this dissertation, we propose to study the understanding of Chinese idioms using neural network models.

We first propose a BERT-based dual embedding model for the Chinese idiom prediction task, where given a context with a missing Chinese idiom and a set of candidate idioms, the model needs to find the correct idiom to fill in the blank. Our method is based on the observation that some part of an idiom's meaning comes from a long-range context that contains topical information, and part of its meaning comes from a local context that encodes more of its syntactic usage. We use BERT to process the contextual words and to match the embedding of each candidate idiom with both the hidden representation corresponding to the blank in the context and the hidden representations of all the tokens in the context thorough context pooling. We also propose to use two separate idiom embeddings for the two kinds of matching. Experiments on a recently released Chinese idiom cloze test dataset show that our proposed method performs better than existing state of the art. Ablation experiments also show that both context pooling and dual embedding contribute to the performance improvement. Observing some of the limitations with existing work, we further propose a two-stage model, where during the first stage we re-train a Chinese BERT model by masking out idioms from a large Chinese corpus with a wide coverage of idioms.

During the second stage, we fine-tune the retrained, idioms-oriented BERT on a specific idioms recommendation dataset. We evaluate this method on ChID and CCT datasets and find that it can achieve the state of the art on both datasets. Ablation studies show that both stages of training are critical for the performance gain. We finally list two future directions that we plan to explore for this thesis, namely, sentiment analysis with idioms and explaining Chinese Chengyu recommendation models.

Speaker Biography

Minghuan TAN is a PhD candidate advised by Associate Professor Jing JIANG in the School of Information Systems, Singapore Management University. His research focuses on natural language processing, currently working on Chinese Idiom Understanding.

Where to find us

Get in touch