Q-learning agents in a Cournot oligopoly model

https://doi.org/10.1016/j.jedc.2008.01.003Get rights and content

Abstract

Q-learning is a reinforcement learning model from the field of artificial intelligence. We study the use of Q-learning for modeling the learning behavior of firms in repeated Cournot oligopoly games. Based on computer simulations, we show that Q-learning firms generally learn to collude with each other, although full collusion usually does not emerge. We also present some analytical results. These results provide insight into the underlying mechanism that causes collusive behavior to emerge. Q-learning is one of the few learning models available that can explain the emergence of collusive behavior in settings in which there is no punishment mechanism and no possibility for explicit communication between firms.

Introduction

In this paper, we model the learning behavior of firms in repeated Cournot oligopoly games using Q-learning. Q-learning is a reinforcement learning model of agent behavior originally developed in the field of artificial intelligence (Watkins, 1989). The model is based on two assumptions. First, for each possible strategy an agent is assumed to remember some value indicating that strategy's performance. This value, referred to as a Q-value, is determined based on the agent's past experience with the strategy. Basically, the Q-value of a strategy is calculated as a weighted average of the payoffs obtained from the strategy in the past, where more recent payoffs are given greater weight. The second assumption of Q-learning states that, based on the Q-values, an agent probabilistically chooses which action to play. A logit model is used to describe the agent's choice behavior. The assumptions made by Q-learning can also be found in other reinforcement learning models. The models of Sarin and Vahid, 1999, Sarin and Vahid, 2001 and Kirman and Vriend (2001) use ideas similar to Q-values, while the models of, for example, Mookherjee and Sopher (1997) and Camerer and Ho (1999) use a logit model to describe the way in which an agent chooses an action. Q-learning distinguishes itself from other reinforcement learning models in that it combines these two elements in a single model. In the economic literature, the combination of these elements has, to our knowledge, not been studied before.

In this paper, we show that the use of Q-learning for modeling the learning behavior of firms in repeated Cournot oligopoly games generally leads to collusive behavior.1 This is quite a remarkable result, since most Q-learning firms that we study do not have the ability to remember what happened in previous stage games. The firms therefore cannot use trigger strategies, that is, they cannot threaten to punish each other in case of non-collusive behavior. There is also no possibility for explicit communication between firms. However, despite the absence of punishment and communication mechanisms, collusive behavior prevails among firms. Apart from Q-learning, there are almost no models of the learning behavior of individual economic agents that predict collusive behavior in Cournot games. The only model of which we are aware is the so-called trial-and-error model studied by Huck et al. (2004a). Yet, experimental results (for an overview, see Huck et al., 2004b) indicate that with two firms collusive behavior is quite common in Cournot games. Q-learning is one of the few models that does indeed predict this kind of behavior.

Models of the learning behavior of economic agents are studied both in agent-based computational economics (e.g., Tesfatsion, 2003, Tesfatsion, 2006) and in game theory (e.g., Fudenberg and Levine, 1998). In agent-based computational economics the methodology of computer simulation is typically adopted, whereas in game theory the analytical methodology is predominant. It seems rather difficult to obtain analytical results for the behavior of multiple Q-learning agents interacting with each other in a strategic setting. In the field of artificial intelligence, it has been proven that under certain conditions a single Q-learning agent operating in a fixed environment learns to behave optimally (Watkins and Dayan, 1992). However, for settings with multiple agents learning simultaneously almost no analytical results are available. Given the difficulty of obtaining analytical results, most of the results that we present in this paper are based on computer simulations. Analytical results are provided only for the special case in which Q-learning firms in a Cournot duopoly game can choose between exactly two production levels, the production level of the Nash equilibrium and some other, lower production level. The analytical results turn out to be useful for obtaining some basic intuition why Q-learning firms may learn to collude with each other.

The remainder of this paper is organized as follows. First, in 2 Related research, 3 , we provide an overview of related research and we introduce Q-learning. Then, in Section 4, we discuss the Cournot oligopoly model with which we are concerned throughout the paper. We consider our computer simulations in 5 Setup of the computer simulations, 6 Results of the computer simulations, in which we discuss the simulation setup and present the simulation results. We provide some analytical results in Section 7. Finally, in Section 8, we draw conclusions.

Section snippets

Related research

The literature on modeling the learning behavior of economic agents is quite large. Overviews of this literature are provided by Brenner (2006) and Duffy (2006). One can distinguish between individual learning models and social learning models (Vriend, 2000). In individual learning models an agent learns exclusively from its own experience, whereas in social learning models an agent also learns from the experience of other agents. Below, we first discuss the modeling of individual learning

Q-learning

In this paper, Q-learning is applied as follows. An agent plays a repeated game. At the beginning of the stage game in period t, the agent's memory is in some state st. This state may be determined by, for example, the actions played by the agent and its opponents in the stage game in period t-1. Taking into account the state of its memory, the agent chooses to play some action at. The choice of an action is made probabilistically based on the so-called Q-values of the agent. Playing action at

Cournot oligopoly model

We consider a simple Cournot oligopoly model with the following characteristics: the number of firms is fixed, firms produce perfect substitutes, the demand function is linear, firms have identical cost functions, and marginal cost is constant. The inverse demand function is given byp=maxu-vi=1nqi,0,where n denotes the number of firms, p denotes the market price, qi denotes firm i's production level, and u>0 and v>0 denote two parameters. Firm i's total cost equalsci=wqifori=1,,n,where the

Setup of the computer simulations

In this paper, we focus on the long-run behavior of Q-learning agents when the probability of experimentation approaches zero. In this respect, the approach that we take is similar to the approach that is typically taken to analyze evolutionary game-theoretic learning models (e.g., Vega-Redondo, 1997, Alós-Ferrer, 2004, Bergin and Bernhardt, 2005). We further focus on settings in which the learning behavior of all agents is modeled using Q-learning. An alternative would be to consider settings

Results of the computer simulations

In this section, we present the results of the computer simulations that we performed. We first consider the simulations with firms that did not have a memory, and we then consider the simulations with firms that did have a memory.

Simulations with firms that did not have a memory were performed for various values for both the number of firms n and the learning rate α. For each combination of values for n and α, Table 1 shows firms’ joint quantity produced and joint profit. Since we focus on the

Analytical results

In the previous section, we presented simulation results showing that the use of Q-learning for modeling the learning behavior of firms in a Cournot oligopoly game generally leads to collusive behavior. This turned out to be the case not only for firms with a memory but also for firms without a memory. This is quite remarkable, since firms without a memory cannot use trigger strategies, that is, they cannot threaten to punish each other in case of non-collusive behavior. So, collusive behavior

Conclusions

We have studied the use of Q-learning for modeling the learning behavior of firms in repeated Cournot oligopoly games. Q-learning, which belongs to the family of reinforcement learning models, combines two elements that, individually, can also be found in other models of the reinforcement learning type. On the one hand, the way in which the performance of a strategy is measured is similar to the way in which this is done in the models of Sarin and Vahid, 1999, Sarin and Vahid, 2001 and Kirman

Acknowledgments

We would like to thank Maarten Janssen, Joost van Rosmalen, three anonymous referees, the associate editor, and the editor for their comments. These comments have significantly improved the paper.

References (35)

Cited by (87)

  • Reinforcement learning in a prisoner's dilemma

    2024, Games and Economic Behavior
  • Algorithmic collusion: Genuine or spurious?

    2023, International Journal of Industrial Organization
  • Pigouvian algorithmic platform design

    2023, Journal of Economic Behavior and Organization
  • Optimal mining in proof-of-work blockchain protocols

    2023, Finance Research Letters
    Citation Excerpt :

    This explains its growing use for Game Theory models. For example, Fershtman and Pakes (2012) use single-agent q-learning to show profit-maximazing behavior in a dynamic asymmetric environment; Waltman and Kaymak (2008) apply multi-agent q-learning to demonstrate collusion in Cournot competition; Calvano et al. (2020) illustrate the effect of implicit collusion on the pricing strategies of firms; Yang et al. (2020) and Weidlich and Veit (2008) provide a comprehensive survey of the various methods that have been used over the years. There are a few papers on cryptocurrencies using similar techniques (Sun Yin et al., 2019; Manahov and Urquhart, 2021), usually focusing on price prediction (Alessandretti et al., 2018; Awotunde et al., 2021; Sebastião and Godinho, 2021; Akyildirim et al., 2021).

View all citing articles on Scopus
View full text