| For any smart agent, a chief question is how to maximize reward in a system also inhabited by other smart agents. In many settings this is particularly difficult, as the long-term best outcome is to cooperate with other agents, but the immediate best action is to not do so. However, if none cooperate, agents receive the worst possible payoff. This problem is known as a Social Dilemma and modeled in various Iterated Prisoner's Dilemma games. In systems with such payoffs, we want individual agents to learn to play the most rewarding sequence of actions. Learning is important here, because this sequence may not be known beforehand (or there may be multiple such sequences), and the sequence may depend on the behavior of other agents: for instance rather than cooperate, it may be possible to exploit other agents. In more complicated Social Dilemma's, there may also be multiple, alternate cooperation modes. We propose the development of reinforcement learning algorithms for noncooperative multi-agent systems. Game theory shows that cooperation can be enforced in Social Dilemma's as a sub-game perfect equilibrium when agents will credibly punish other agents for undesired behavior: a form of teaching. The proposed learning algorithms explicitly model the ``teaching'' effect an agent can have on other agents by learning about the other agent(s), and then optimizes an agent's expected future reward. We propose to scale this approach by allowing agents to explicitly punish each other. Additionally, we will derive performance guarantees and conjectural equilibrium results. |