Home / Papers / Large Language Models (LLMs) Inference Offloading and Resource Allocation in...

Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach

DOI: 10.1109/TMC.2024.3415661Semantic Scholar

22 Citations•2024•

Ying He, Jingcheng Fang, F. Yu

IEEE Transactions on Mobile Computing

This paper proposes a novel approach based on active inference for LLMs inference task offloading and resource allocation in cloud-edge computing that has superior performance over mainstream DRLs, improves in data utilization efficiency, and is more adaptable to changing task load scenarios.

Abstract

With the increasing popularity and demands for large language model applications on mobile devices, it is difficult for resource-limited mobile terminals to run large-model inference tasks efficiently. Traditional deep reinforcement learning (DRL) based approaches have been used to offload large language models (LLMs) inference tasks to servers. However, existing DRL solutions suffer from data inefficiency, insensitivity to latency requirements, and non-adaptability to task load variations, which will degrade the performance of LLMs. In this paper, we propose a novel approach based on active inference for LLMs inference task offloading and resource allocation in cloud-edge computing. Extensive simulation results show that our proposed method has superior performance over mainstream DRLs, improves in data utilization efficiency, and is more adaptable to changing task load scenarios.