Achieving and maintaining cooperation between agents to accomplish a common objective
is one of the central goals of Multi-Agent Reinforcement Learning (MARL).
Nevertheless in many real-world scenarios, separately trained and specialized agents
are deployed into a shared environment, or the environment requires multiple objectives
to be achieved by different coexisting parties. These vari- ations among specialties and
objectives are likely to cause mixed motives that eventually result in a social dilemma where
all the parties are at a loss. In order to resolve this issue, we propose the Incentive Q-Flow (IQ-Flow)
algorithm, which modifies the system’s reward setup with an incentive regulator agent such that the
cooperative policy also corresponds to the self-interested policy for the agents. Unlike the existing
methods that learn to incentivize self-interested agents, IQ-Flow does not make any assumptions about
agents’ policies or learning algorithms, which enables the generalization of the developed framework to
a wider array of ap- plications. IQ-Flow performs an offline evaluation of the optimality of the learned
policies using the data provided by other agents to de- termine cooperative and self-interested policies.
Next, IQ-Flow uses meta-gradient learning to estimate how policy evaluation changes according to given
incentives and modifies the incentive such that the greedy policy for cooperative objective and self-interested
objective yield the same actions. We present the operational characteristics of IQ-Flow in Iterated Matrix Games.
We demonstrate that IQ-Flow outperforms the state-of-the-art incentive design al- gorithm in Escape Room and
2-Player Cleanup environments. We further demonstrate that the pretrained IQ-Flow mechanism significantly
outperforms the performance of the shared reward setup in the 2-Player Cleanup environment.
|