UTILIZATION OF DEEP REINFORCEMENT LEARNING FOR DISCRETE RESOURCE ALLOCATION PROBLEM IN PROJECT MANAGEMENT – A SIMULATION EXPERIMENT

: This paper tests the applicability of deep reinforcement learning (DRL) algorithms to simulated problems of constrained discrete and online resource allocation in project management. DRL is an extensively researched method in various domains, although no similar case study was found when writing this paper. The hypothesis was that a carefully tuned RL agent could outperform an optimisation-based solution. The RL agents: VPG, AC


Introduction
According to selected definitions, project management is a set of actions, including the "application of knowledge, skills, tools, and techniques to project activities to meet project requirements.Project management refers to guiding project work to deliver the intended outcomes" (Institute, 2021;Manikantan and Gurusamy, 2016).Work on each project is usually divided into several phases: conceptualisation, definition, planning, execution, and termination (Schwindt, 2010).Optimal planning, including estimation, scheduling, and constrained allocation, has been addressed in many research studies because any project activities are subject to precedence, resource, and time constraints on limited resources (Selaru, 2012).The need for better precision and speed in decision-making has generated significant demand for automated systems, which have slowly started to replace older solutions (Gupta, Modgil, Bhattacharyya, and Bose, 2022).Numerous recommendations for the so--called data-driven companies have been created, including the need to have systems for predictive modeling, forecasting, optimisation, and planning (Anderson, 2015;Sharda, Delen, and Turban, 2020).In particular, in fields like Operations Research (OR) which focuses on improving day-to-day company decisions (Gupta et al., 2022), fusion with machine learning and big data fields have become more critical as the volumes and velocity of data grow every year (Bhimani and Willcocks, 2014;Duan, Edwards, and Dwivedi, 2019).
This paper aimed to assess the applicability of a rapidly growing family of machine learning algorithms, called Reinforcement Learning (RL), to the problem of constrained resource allocation in project management under a strong environment uncertainty.The research hypothesis in the study is that a carefully chosen RL-based agent can outperform a classic constrained-optimisation approach in a simulated environment.
The paper is structured as follows: Sections 1.1 and 1.2 give an overview of the existing research on resource allocation problems and the applicability of RL techniques.Section 1.3 presents some theoretical background necessary to understand RL mechanics.Part 2 describes the experimental setup: Sections 2.1 and 2.2 present the simulator design and possible actions, while 2.3 and2.4 give more detail on the objective function and experiment variants.Section 2.5 focuses on the algorithms selected for comparative study.Part 3 presents a detailed breakdown of the results and their analysis.

Previous work on optimisation techniques for resource allocation
Optimisation has been a research subject for over a century and laid a foundation for fields such as operations research (Ackoff, 1956).It is one of the critical tools, as it helps to make decisions that allow managers to choose the most promising options out of the available alternatives, often including the long-term horizon (Schwindt, 2010).A review and meta-analysis of numerous publications from the late 1990s up to the present reveals an increasing interest in the design of optimisation systems with meta-heuristics, evolutionary algorithms, and parallel computing, that operate under strong uncertainty (Chiang and Lin, 2020;Farhang Moghaddam, 2019).
The existing solutions to such problems included traditional linear programming (mostly mixed-integer constraint optimisation) (Islam, 2011;Kane and Tissier, 2012), entropy minimisation (Ye, Shi, Li, and Shi, 2014), or combinatorial multi-armed bandits (Zuo and Joe-Wong, 2021).The applicability of swarm or evolutionary algorithms, constraint satisfaction, and linear optimisation was assessed in key areas such as human resource allocation (Chiang and Lin, 2020).

Previous work on reinforcement learning applications
Over the years, multiple approaches to utilise reinforcement learning for resource allocation have been made.Some of them were based on the fundamental principles of Markov Decision Processes (MDP) and their use in Supply Chain Management (Giannoccaro and Pontrandolfo, 2002); others directly followed early RL formulations such as Q-learning for Business Process Management (Huang, van der Aalst, Lu, and Duan, 2011).Numerous publications describe attempts to apply different RL tools for constrained task scheduling and packing problems (Jędrzejowicz and Ratajczak-Ropel, 2013;Mao, Alizadeh, Menache, and Kandula, 2016) and logistics (Yan et al., 2021;Yuan, Li, and Ji, 2021).
The RL solutions for production planning and dynamic worker scheduling have been reviewed extensively (Koulinas, Xanthopoulos, Kiatipis, and Koulouriotis, 2018;Shyalika, Silva, and Karunananda, 2020), which highlight the benefits and drawbacks of different algorithm families, assessed with criteria such as convergence speed and sampling efficiency.Production planning differs from the problems presented in this study, but the initial choice of the RL algorithms was based on previous research in this area and the existing recommendations (Shyalika et al., 2020;Yu, Zhang, Jiang, Yang, and Shang, 2021).
RL is widely used in technical constraint resource allocation problems, e.g.server load balancing, channel allocation in telecom, and network management (Li et al., 2018;Xu et al., 2021;Ye, Li, and Juang, 2019).Most of these approaches utilise different Q-learning variants.
To date, no publications directly focused on RL in project management sequential resource allocation with simulation experiments have been found.

Reinforcement learning overview
Reinforcement learning (RL) is one of the subfields of machine learning, focused on the autonomous agent interacting with the environment.An agent receives rewards by randomly acting in the environment, gradually improving performance (Schulman, 2016).The goal is to learn the actions that maximise the expected cumulative reward over time (Mousavi, Schukat, and Howley, 2018).Typically, reinforcement learning problems are modelled using MDP, formalised as follows (Arulkumaran, Deisenroth, Brundage, and Bharath, 2017): 1. is a set of states that describe an environment.2.
is a set of actions that can be performed in the environment.3.
-transition dynamics that describes the consequences of executing action a t state s t .This part can be probabilistic, lading to different s t+1 values.
4. is a set of rewards obtained in transition. 5.A sequence of states, actions and rewards during the episode is called a trajectory T rollout.
6.All rewards accumulated during the trajectory rollout are called returns, denoted as R. The agent attempts to remember the consequences of past actions and utilises a discounting future reward by factor .The return is formalised as follows The goal of any agent is to define a policy π that represents the agent's behaviour during interaction.It is a strategy function that maps encountered states s t into actions, and is denoted as (Sutton, Bach, and Barto, 2018): An agent should learn a policy that achieves a maximum return from all the states During the learning process, the agents use some additional functions that help them find the best way to construct the policy.Policy action value (Mnih et al., 2016;Sutton et al., 2018) ( describes the expected return from executing action while being in state and is often called a "q-function" or "q-value".It can be expressed in a recursive form, known as the Bellman equation which laid the foundation for more advanced RL algorithms (Arulkumaran et al., 2017;Bellman, 1954) This equation can be interpreted as the relationship between the current and future Q-values.The current value consists in the actual reward and discounted future Q-values, each one being a reward obtained from following policy π (as described in equation ( 2)), starting from current state s t .
The value function of state s t under policy π describes the expected returns when following π from state s t In practice, there are multiple paradigms and approaches for learning such mappings.Two prominent families of algorithms include on-policy and off-policy learning.In the on-policy setting, an agent executing the algorithm is restricted to following a policy learnt so far; it can improve in a later phase.An agent can gain experience using functions other than the current policy in the off-policy approach (Sutton et al., 2018).Typical examples of these approaches are SARSA and Q-learning algorithms, where the former is an on-policy, and the latter is an off--policy.SARSA is formalised as follows (see Sutton et al., 2018) where α is a learning speed parameter.This function corrects the current estimation of the q-value for a given state-action pair, after calculating and receiving future values, when following current policy.In contrast, off-policy Q-learning update utilises maximal next time-step q-value, not necessarily the one from policy π In a deep learning RL variant (DRL), both functions are approximated with a set of (at least one) neural networks parameters θ, which is formalized as (Mnih et al., 2016;Sutton et al., 2018).One of the significant improvements over the previous methods was the introduction of the so-called 'policy gradient'.It optimises the policy by calculating the gradient with respect to its parameters.Formally (Schulman, 2016): where is a trajectory rollout under policy parameters θ and R(T) is the total reward of the trajectory.The implementations of this general idea include approaches such as "Vanilla" Policy Gradient (VPG), which calculates a policy gradient after each episode termination, effectively performing a Monte Carlo update (Schulman, 2016).Due to its instability and sampling inefficiency, such an approach was further extended by combining a policy learning and value function learning simultaneously -called the actor-critic method."Actor" (parametrised by θ) learns policy π, which is verified by the "critic" -a value function estimator parameterised by ϕ (Mnih et al., 2016;Schulman, 2016;Sutton et al., 1999).In that case, a learned value-function is treated as a baseline in the policy gradient, which serves the normalisation and correction purposes (Arulkumaran et al., 2017).Stability and variance improvement was achieved with the so-called "advantage estimations", where the returns per each time step are replaced by other calculations, emphasising the differences between actions and the default state-value (Mnih et al., 2016;Schulman, Moritz, Levine, Jordan, and Abbeel, 2016).In that context, an advantage can be defined as Several variants of such calculation exist, including Generalised Advantage Estimation (GAE), utilising an exponentially-weighted moving average of subsequent time-step advantages (Schulman et al., 2016) ( ) ( ) and where λ is an additional smoothing parameter, used along with discounting factor γ.
Such calculations can be used optionally in the VPG algorithm and its later extensions, e.g.Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016), Advantage Actor-Critic (A2C) (Wu et al., 2017) and Proximal Policy Optimisation (PPO) (Schulman, Wolski, Dhariwal, Radford, and Klimov, 2017).The latter algorithm further improves the policy stability by introducing a clipped surrogate objective function, limiting the amount of changes to the policy on each iteration (Schulman et al., 2017).

Simulator overview and observation space
In order to assess the applicability of RL in resource allocation for project management, a dedicated simulator was designed.It can be treated as an analogy to a constrained human resource allocation task for a company in subsequent months or any other venture that requires subsequent allocation of limited resources.
It is composed of multiple discrete time steps, and in each timestep, two projects are available.Each project has a different resource allocation requirement, associated probability (chance) of success, and payout per resource allocated; it can be formalised as follows: 1. b i -is the balance of simulated "company" at time step i, the total amount of money available.The starting balance is indicated as b 0 .
2. res i -is the number of resources available to the simulated "company," at step i.The starting resources are indicated as res 0 .
3. C -is the upkeep cost for each idle resource unit.If the unit is not allocated to any project, a "company" will have to pay the given amount of money for upkeep.
4. C + -is the resource increase cost, incurred when the agent wants to increase the number of resources available.It replicates the real-world market (e.g. the recruitment process for new employees).
5. N -number of discrete time steps, equal to the number of decision points.6.
-for each -th timestep, is the probability of success for projects one and two. 7.
-for each i-th timestep, is the reward in each project (payout) per resource allocated. 8.
-for each i-th timestep, is the maximum demand for resources in projects one and two.
Therefore, the observation space in each timestep (s i ) is a vector composed of the following elements , , , , , , , The vectors with such a structure are used as input in subsequent timesteps for all agents in the simulations.

Action space
After receiving a vector indicated by equation ( 13), the agent must pick an action.In a discrete control setting, it can choose one action out of the following: 1. a 1 -try to allocate demand for project 1, resulting in alloc i 1 = min (d i 1 , res i ). 2. a 2 -try to allocate demand for project 2, resulting in alloc i 2 = min (d i
3. a 12 -try to allocate half of the demand for project 1 and a half for project 2, resulting in and .
-increase the number of resources by respectively 10/25/50%, while keeping the remaining idle (incurring upkeep cost), resulting in alloc i +x = x × res i ).If the number of resources changes (due to action selection), a modified number of resources will become the starting value for the next time step.Therefore: x res a a a or a res res alloc a alloc a

Simulation objective and rewards
After the allocation is selected, the simulator will check if the selected project(s) succeeded; each project is an independent Bernoulli trial with a probability of success p i 1 , p i 2 .The agent will receive a reward, depending on the allocation choice and if the project j succeeded in time step i (indicated as ).
The agent will receive a reward depending on the allocation choice and success of the project { } ( ) ( ) The accumulated rewards change the running balance of the agent.The goal is to maximise the rewards accumulated during the whole episode.The simulation terminates in one of three situations: when the balance is less or equal to zero, or when the agent runs out of resources, or after a predefined number of time steps.

Simulation variants
In order to judge how different agents behave in different conditions, three scenarios were used in the simulation.The distinctive parameter for these environments was the probability of project success capped to a min/max between 0.1.
1.In an "easy" environment, the probability of success was a random number drawn from a truncated normal distribution with μ 1 = 0.7, σ 1 = 0.2.The mean chance for a project to succeed in this setting is 70%.With an initial resource pool and starting balance, it should be easy for an agent to successfully generate large incomes during simulation without risking "bankruptcy" or running out of resources too early.
2. In a "moderate" environment, the probability of success was a random number drawn from a truncated normal distribution with μ 2 = 0.5, σ 2 = 0.2.On average, only 50% of the projects succeed.With an initial balance and resource pool, the agent should choose projects wisely, potentially splitting allocation or investing in new resources, to avoid bankruptcy or inability to operate.
3. In a "hard" environment, the probability of success was a random number drawn from a truncated normal distribution with μ 3 = 0.3, σ 3 = 0.2.Only 30% of projects succeed in this setting.It will be tough for an agent to generate any income and avoid failure even in the early stages of the simulation.
The success probabilities mentioned before, are presented in Figure 2. The upkeep costs were kept constant across all variants.The payouts for each project were random numbers generated from a truncated normal distribution with μ pay = 1.0, σ pay = 0.5 capped to a min/max of between −0.5/ +1.5.Each environment consisted of 300 timesteps before termination.

Agents in the study
Five different types of agents were tested.
1. Random agent -treated as a benchmark, selects random actions at every timestep.
2. Optimisation agent -performs classic constraint optimisation on each step.It checks the expected reward for each action based on the probability of success.Therefore it seeks an action that maximises the following where r i is a reward as indicated in equation ( 15).The predictions (decisions) of this agent are deterministic.3. Reinforcement learning, policy gradient agents (based on the algorithms described in Section 1.2): a.The VPG agent, composed of "Actor" and "Critic" neural networks.
The agent was improved with a GAE calculation for the full Monte Carlo episode for better performance and variance stability.b.The A2C agent with N-step GAE advantage estimation.During the tuning of the parameters, the optimal number of steps was set to 25. c.The PPO agent with N-step GAE advantage estimation and surrogate objective clipping function was set to 0.2.The Actor and Critic neural networks for all the aforementioned agents have the same architecture and are presented in the figures 3 and 4.
The inner network layers utilised a hyperbolic tangent activation function due to its scaling properties.The outer layer of a critic network utilised a linear activation function, while the actor-network performed a softmax classification to one of the available actions.

Experimental design
The testing procedure remained the same in all the environments.It consisted of the following steps: 1. Check the score of each agent on 500 independent, random simulation iterations before the training or fitting procedure.
2. Fit the agent (if applicable) on 300 independent, random simulation iterations.
3. Check the final scores on 500 independent, random simulation iterations.Store the results.
4. Perform statistical tests on gathered data -scores after the training.5. Assumptions: a.Each test score is independent from the others and identically distributed -each model is trained on 500 random episodes.The simulation environment is reset every time.b.Each set of results per model was checked for normality using the Shapiro Wilk test; because of skewness, the tests did not confirm that data comes from a normal distribution.c.Equality of variance across model scores was checked using the Levene test.In each case the null hypothesis was rejected, so population variances cannot be considered equal.6.According to the research (Arcuri and Briand, 2014;Colas, Sigaud, and Oudeyer, 2019), the Welch T-test with p-value correction, followed by post-hoc pairwise testing, is considered the most robust in comparing RL algorithms results violating equality of variance and normality assumptions.Considering the large number of independent samples (N = 500), the Welch T-test on a stricter significance level α = 0.01 can be considered as the primary testing procedure (Arcuri and Briand, 2014).It yielded the lowest false positive errors compared to the non-parametric Mann-Whitney or ranked t-tests (Colas et al., 2019).Therefore, the following methods were utilised for each simulation setup: a. Perform a single Welch ANOVA on a significance level α = 0.01, to test if there is a significant difference in the scores between all agents.b.Perform a series of post-hoc tests between pairs of agents to judge which one in the pair performs better.The method of choice for pairwise comparison was the Games-Howell test, as it can be combined with the Welch ANOVA in a non-equal variances setup (Bagherzadeh, Kahani, and Briand, 2021;Games and Howell, 1976).

Results overview
The table 1 presents the mean, median and standard deviation of the results achieved by each agent in the simulations.The numbers in brackets indicate the standard deviation for the scores.The maximal score for a given setup is given in bold.All the scores were rounded up to the third decimal place.The boxplot below presents a graphical distribution of each model's scorings in each environment.The bounds of the box represent the 25th and 75th percentile, while the line in the middle -the median.The Upper/lower whiskers are the lowest/ highest values that are within 2.5 standard deviations from the mean.The more spread or the longer the box is -the more stretched the scorings distribution.
Looking at the table and plots, it is clear that the PPO algorithm performed best in the most challenging environments, i.e. "moderate" and "hard".As described in the section "experimental design", a series of statistical tests were performed to judge the significance of differences between the models: Welch ANOVA on significance level α = 0.01 to evaluate the overall differences, and a series of pairwise comparisons using the Games-Howell method, followed by a p-value correction for controlling the false discovery level; the results are presented below.In all the tables, the top row represents the Welch ANOVA test with degrees of freedom 1 (treatment levels/ groups), degrees of freedom 2 (no. of observations -no. of groups), the test statistic, and p-value.The subsequent rows represent pairwise model comparisons, with the score's difference, its standard deviation, the test statistic, and Hedge's effect size (*/**/***-small/medium/large) (Ferguson, 2009;Sullivan and Feinn, 2012).The insignificant comparisons (with pval ≥ 0.01) were greyed out.In each pair, the model with better results was highlighted in bold and underlined.In the "easy" environment, where the chance for success for each project is relatively high, the agent utilising classic optimisation procedures performed the best, followed by the PPO and AC agents, which proved to be slightly less effective.This difference can be attributed to the stochastic nature of the RL algorithms and their inherent instability -as machine learning procedures, their results may vary.When the chance of success is very high, the sequential problem reduces to subsequent single-step, greedy optimisation tasks, for which classic tools proved to be the best choice.
In both the "moderate" and "hard" environments, the PPO agent significantly outperformed both the classic optimisation method and the other RL algorithms.These differences' size of effect and absolute value are large, measured in thousands for the "moderate" environment.This effect can be attributed to the fact that the RL algorithms possess the ability to plan in the long-term horizon due to reward discounting and advantage estimation, which is not possible for more straightforward, greedy optimisation tools.
In conclusion, the study shows that deterministic optimisation remains the best choice for stable, well-defined environments.Advanced RL methods, such as PPO, are best suited for challenging, stochastic environments connected with uncertainty.

Discussion
The simulation presented in this study is far from being a perfect model of a real resource allocation scenario.
The first improvement and future study direction could be to transform it into a continuous control problem, where an agent can allocate an exact number of resources or their proportion to projects instead of discrete chunks.Problem simplification to discrete actions instead of numerical ones is standard for many RL algorithms.Such an approach can be problematic when the dimensionality of the action vector is large (Dulac-Arnold et al., 2015;Lillicrap et al., 2016;Smart and Kaelbling, 2000).Therefore it will be a natural extension to transform the simulator presented in this study to operate in numerical action-space, and properly adjust the agents' policies.
Other improvements could include resource-locking long-term projects, nondeterministic project shutdowns, and sudden resource outages to make the simulation more realistic.

Conclusion
The simulation experiment performed in this study was supposed to replicate sequential resource allocation processes similar to these in various project management tasks.The classic resource optimisation algorithm was compared to RL algorithms on different difficulty levels to judge the performance of such methods.Strict statistical experimentation proved that the PPO algorithm performed best out of all the tested methods in a more challenging setup connected with uncertainty.
The implementation of a simulator presented in this paper can be treated as a starting point for more complicated designs that include resource locking, changing market conditions, and continuous control.

Fig. 5 .
Fig. 5. Distribution of scores per each model Source: own work.

Table 1 .
Scores of each agent in each environment

Table 2 .
Pairwise comparisons of agents on the "easy" environment