The demo is loading and is ready in one second...
Note: this works only in modern browsers, so make sure you are on the newest version🤘
This is a project I have been working on for quite some time now. These cars learned how to drive by themselves. They got feedback on what good and what bad actions are based on their current speed as a form of reward. Powered by a neural network.
You can drag the mouse to draw obstacles, which the cars must avoid. Play around with this demo and get excited about machine learning!
The following is a more detailed description of how this works. You may stop here and just play with the demo if you're not interested in the technical background!
Neural networks
The agents learn by adjusting the weights of their neural network (function approximators). In this case, this involves two neural networks: one state to action net (3-layer, 150 neurons), one state + action to q-value net (2-layer, 200 neurons). The q-value describes how good an action is. By learning the second network, "the value network", you can obtain policy gradients, which you can then use to learn the first network. The first network, "the actor network", is now your decision maker. This algorithm is called "Deep Deterministic Policy gradient" or in short DDPG. Combining this with state-of-the-art techniques, results in the cars you can see above. These techniques include prioritised experience replay buffers, ReLU non-linearites and the Adam learner. Even though this, at first, might seem reasonable, a lot of trouble with neural networks these days is the hyper-parameter search. There are at least a dozen of parameters you need to tune in order to achieve optimal results, which is kind of a drawback. In the future this might be overcome by automatic hyper-parameter search, which iterates over a set of hyper-parameters and finds the best.
Sensors
The state (or the input to the neural nets) of the agent consists of two time-steps, the current time-step and the previous time-step. This helps the agent make decisions based on how things moved over time. For each time-step the agent receives information about its environment. This includes 19-distance sensors, which are arranged in different angles. You can think of these sensors as beams, that stop when they hit an object. The shorter the beam, the higher the input the agent gets (0 – for no hit, 1 – for a very short beam). In addition, a time-step contains the current speed of the agent. In total, the input to the neural networks is 158-dimensional.
Exploration
A major issue with DDPG is exploration. In regular DQN (deep Q-networks) you have discrete actions from which you can choose from. So you can easily mix up your action-state-space by epsilon-greedily randomising actions. In continuous spaces (as the case with DDPG) this is not as easy. In this project I used dropout as a way to explore. This means dropping some neurons of the last layer of the actor network randomly and therefore obtaining some kind of variation in actions.
Multi-agent learning
In addition to applying dropout to the actor network, I put 4 agents into the virtual environment at the same time. All these agents share the same value network, but have different actors and therefore have different approaches to different states, thus every agent explores different areas of the state-action space. All in all this resulted in better and faster convergence.
The code for the demo above along with the JavaScript library I made is available on GitHub. If you want to hear more on the progress of the project as I add new features, I encourage you to follow me on Twitter @janhuenermann! Additionally feel free to share the project in social media, so more people can get excited about AI!