
SMU Assistant Professor Zhou Pan is working to train AI models at lower cost, time, and environmental impact.
By Alvin Lee
SMU Office of Research – AI tools’ capabilities have expanded beyond many people’s expectations, with software such as DALL-E (image generation), Cursor (AI assistant for coding), and Claude (information processing) delivering real-world impact that would have been unimaginable merely five years ago. For the better-known AI models, ChatGPT and Gemini continue to iterate and improve, pushing AI to ever higher adoption rates.
These popular AI models are trained using optimisers, which function and high costs SMU Assistant Professor of Computer Science Zhou Pan had explained to the Office of Research in an article in 2024. In his latest research project, which clinched a Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2 grant, Professor Zhou examined the optimisers such as SGD (Stochastic Gradient Descent) and AdamW (Adaptive Moment Estimation W) that are used to train these AI models, and identified three main issues that contribute to the high training costs:
- Slow convergence speed caused by inaccurate stochastic gradients;
- High communication costs, i.e., big amounts of data needed to be shared between devices during training, among GPUs for stochastic gradient synchronisation; and
- Huge GPU memory costs due to maintaining optimisation states such as the first-order moment in Adam.
Not FOMO but FoCo
Professor Zhou’s project, “FoCo: Fast, Communication- and Memory-Efficient Optimizers for Training Large AI Models”, aims to address those concerns with the following objectives:
- Design an advanced optimiser which can converge faster than popular ones such as Adam and AdamW, in training large AI models;
- Develop novel approaches to reduce the communication cost of optimisers, particularly the proposed optimisers in objective 1; and
- Reduce the GPU memory cost of optimisers, particularly those in objectives 1 and 2, for large-scale training.
“Improvements in one aspect positively affect others,” Professor Zhou elaborates. “For instance, reducing memory costs enables larger minibatches that can reduce gradient noise and typically accelerate optimisers. Given the widespread adoption and immense potential of large AI models across various fields, along with their current challenges like high training costs, lengthy development cycles, significant electricity consumption and carbon dioxide emissions, the study of FoCo is more necessary and urgent than ever before.”
Much of the work will involve reducing ‘gradient noise’, where ‘gradient’ tells the AI how to change its parameters to reach ‘convergence’, while ‘gradient noise’ refers to the random fluctuations in gradient caused by using a few training data points instead of entire training data to compute gradient at each training iteration. The goal is to reach convergence as soon as possible, Professor Zhou explains.
“When training an AI model, we iteratively update its parameters to minimise the training loss,” says Professor Zhou, referring to mistakes made, such as identifying a handwritten ‘8’ as ‘B’. “In each training step, the optimiser adjusts the parameters but it cannot immediately reach the optimal values – instead, it gradually approaches them over many iterations. Convergence occurs when the parameters stabilise, meaning further updates no longer significantly improve the model (the training loss cannot be reduced).
“The convergence speed refers to the number of training steps required to reach this stable state. If the required training step number is big, then the convergence speed is slow; otherwise, the speed is fast. By improving the optimiser’s update strategy, we can reduce the number of steps needed, accelerating training without sacrificing model performance.”
Artificial intelligence, real impact
FoCo-derived AI improvements can have significant real-world benefits in dynamic environments, such as self-driving cars. “These systems can be updated or fine-tuned more frequently and cost-effectively, leading to quicker deployment of safer, more responsive, and contextually aware AI. Moreover, smaller carbon footprint aligns with ESG goals for tech companies,” says Professor Zhou.
Additionally, FoCo could significantly lower training costs and reduce resource demands such as memory and GPU usage, while its optimisations will democratise access to large AI models, observes the computer scientist. “Smaller companies, startups, or academic labs with limited computing infrastructure will be better positioned to train or fine-tune state-of-the-art models without prohibitive investment in hardware.”
He adds: “This research is poised to shift how the AI community approaches large model training – from relying solely on hardware improvements to embracing algorithmic efficiency. For models like GPT and LLaMA, it could enable more sustainable scaling, continuous training, and faster experimentation. Moreover, FoCo’s innovations may inspire new directions in optimiser design, setting a benchmark for how future foundation models are trained globally – faster, greener, and more economically.”
Back to Research@SMU May 2025 Issue