The Photoshop software provider is making its third-biggest acquisition to create an end-to-end system for designing digital ads and more.
Sent from Mail for Windows 10
Total supply over 100k what does this mean .
Delayed gratification, 50-moves rule, discounted rewards, and an exercise from Sutton and Barto 2nd edition
There are earlier discussions on the lczero group (see bottom of this post) about Leela Zero aimlessly moving about in a won position, apparently constrained only by the 50-move rule to preserve its anticipation of a win. I don’t like this behaviour, but I have seen it defended by appeal to Zero Knowledge principles; a win is a win and we don’t care for elegance. I like to make a point here, backed by higher authority, that we should care for elegance and that it is appropriate to favour quicker gratification.
The following is from the second edition of Sutton and Barto, Reinforcement Learning . (It is Exercise 3.7 in the online edition of May 15, 2018; exercise 3.8 in an earlier online edition.)
Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward. After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?
The book does not provide answers to the exercises, but the problem is placed in the context of a discussion of discounting. Note that a strategy of random moves at all times will eventually let the robot escape and obtain its reward, so the training will not improve over a strategy of random moves. I trust that for the authors the failure to improve over the random strategy is wrong. It appears clear to me that the desired answer to the second question is that the human is not effectively communicating what is to be achieved. The point of the exercise is then that the problem specification is defective; the human would have meant to design a robot that escapes from a maze sooner rather than later. The human should have included discounting in the training goal.
With this interpretation of the point of exercise 3.7, and then with the authority of Sutton and Barto behind me, I think that the problem specification of LCZero is defective. It should really not be viewed as a violation of the Zero Knowledge principle to insist that a win is best enjoyed sooner rather than later, and the reward of +1 for achieving a mate or of -1 for suffering a mate should be discounted in time (moves or plies).
This does, unfortunately, require the specification of a discounting rate. Fortunately we have a natural time scale given by the 50-moves rule. I think that a natural starting guess would be to discount by a factor of 1/e per 50 moves, so by exp(-0.01) each ply. This discount rate might then be tuned by the Zero Knowledge criterion of maximizing the expected score as computed without discounting, but favouring more rapid discounting when there is no statistically discernible advantage the other way.
 Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Second edition. Cambridge: MIT press, 2018. Online: See http://incompleteideas.net/book/the-book-2nd.html.
Related earlier discussions on the lczero Google Group:
2018-03-23 Missing mates;
2018-04-08 Poor endgame;
2018-05-03 Bad evaluation by Leela !!;
2018-05-18 Expected remaining moves.
See also this page in the Github leela-zero wiki Missing Shallow Tactics.
e8 Lie Group