showSidebars ==
showTitleBreadcrumbs == 1
node.field_disable_title_breadcrumbs.value ==

PhD Dissertation Defense by Mudiyanselage Dulanga Kaveesha WEERAKOON | Enabling and Optimizing Multi-Modal Sense-Making for Human-AI Interaction Tasks

Please click here if you are unable to view this page.

 
 

Enabling and Optimizing Multi-Modal Sense-Making for Human-AI Interaction Tasks

Mudiyanselage Dulanga Kaveesha WEERAKOON

PhD Candidate 
School of Computing and Information Systems 
Singapore Management University 
 

FULL PROFILE

Research Area

Dissertation Committee

Research Advisor

Co-Research Advisor

  • Vigneshwaran SUBBARAJU, Research Fellow, Agency for Science Technology and Research

Dissertation Committee Member

External Member

  • Nairan ZHANG, Engineering Manager, Amazon
 

Date

8 May 2024 (Wednesday)

Time

11:00am – 12:00pm

Venue

Meeting room 5.1, Level 5 
School of Computing and Information Systems 1, Singapore Management University, 80 Stamford Road Singapore 178902

Please register by 7 May 2024.

We look forward to seeing you at this research seminar.

 

ABOUT THE TALK

Natural human-human interaction is inherently multi-modal, as we use a variety of modalities including verbal commands, gestures and facial expressions, visual cues, gaze and even vocal nuances (e.g., tone and rhythm) to mutually convey our intent. Motivated by such human-human interaction scenarios, this thesis broadly investigates some methods to enable multi-modal sense-making for human-AI interaction tasks in resource-constrained wearable and edge devices. In particular, we consider object acquisition as an exemplary task for human-AI collaboration that can benefit from enabling naturalistic multi-modal interaction. To address this, we leverage Referring Expression Comprehension (REC) or Visual Grounding models developed in computer vision and NLP literature. These models, when provided with an image along with verbal and/or gestural inputs, identify the bounding box of the referred object. We then introduce a number of sense-making models and optimization techniques to support low-latency execution of such models for inferencing on pervasive devices. 

In this thesis, our emphasis will be predominantly on exploring diverse dynamic optimizations for the comprehension of task instructions. Throughout these investigations, we rely on a common guiding principle that underscores our approach: the acknowledgement that not all instructions pose the same level of task complexity. To illustrate, consider the varying complexities introduced by different types of instructions. In a cluttered environment, identifying a target object often necessitates a more intricate execution pipeline to ensure accurate identification. Users may employ a combination of language instructions and pointing gestures, which can aid the model in disambiguation among closely situated objects. Consequently, the presence of multiple modalities helps alleviate task complexity. Conversely, in a less cluttered space, a simple pointing gesture may suffice for object identification, requiring a less complex execution pipeline. This nuanced understanding of task complexities serves as the foundation for the dynamic optimizations explored in this thesis. 

This dissertation is organized into two parts. Part 1, focuses on studying model optimizations applied to REC models, which process a single static image along with language and, optionally, gestural modalities. In Part 2, we extend our methodologies to more complex scenarios involving videos as vision input, moving beyond single static images.

 

ABOUT THE SPEAKER

Dulanga WEERAKOON is a PhD candidate at the SCIS, SMU. Under the supervision of Prof. Archan Misra and co-supervision of Dr. Vigneshwaran Subbaraju, his research focuses on developing multi-modal human instruction understanding models integrating language, vision, and gestural cues. He primarily aims to optimize these models for pervasive and mobile devices, emphasizing low latency and energy overheads.