On Nature 2.0: A Token Engineering Framework for Designing On-chain Behaviour of Intelligent Agents

Google Deepmind’s AlphaZero Plays Starcraft2 Professionally.

This essay was inspired by Trent McConaghy’s presentation of Token Engineering at Ocean Protocol for the Ocean Protocol Study Group in which Tokenspice2 is showcased as an EVM-in-the-loop agent-based simulator for token economic modelling. This inspired me to consider the implications of training intelligent agents in these simulators and unleashing them on main-net protocol interactions.

Later I found out that Trent thinks about these topics a lot and has been writing about them for years as the general concept of Nature 2.0. Nature 2.0 is described as the point in which human created technology becomes as resilient, diverse, and autonomous as the nature we see in the forest or in the ocean. An example of how this could happen is through the mechanism of an AI-DAO, an autonomous organization ran by AI, or through self-owning and expanding natural ecosystems or infrastructure such as what has been produced in the case of Terra0

The remainder of the article is reproduced from the original which was hosted on Notion: https://www.notion.so/Reinforcement-Learning-in-Token-Engineering-51534090de3342b0b0ada26b92158f7b

You can see a video presentation of the article here as presented to the Ocean protocol study group at the Token Engineering Academy:

Presented to the Ocean Protocol Study group in the Token Engineering Academy

I have this theory that we should be able to combine two powerful frameworks that each revolve around agent-based modelling. One from the field of general artificial intelligence, and one from the domain of token engineering. Firstly,Stable Baselines, the result of decades of research, open-source development, and the gumption of Elon Musk, is a high-level interface to well established and tested reinforcement learning algorithms that come with stable hyper-parameter tunings. Secondly, hailing from the domain of rigorous verification engineering, and CAD, the mind-boggling agent-based economic simulator and EVM interface TokenSpice2 made by Trent McConaghy for modeling the OceanProtocol ecosystem. I believe that these two frameworks can be combined for developing, testing, and deploying intelligent agent-based economic networks. If my theory is correct, then, watch out world, the impacts of this technology could be as profound as the Bitcoin Whitepaper itself. Satoshi enabled web-based economies. We are enabling AI-based economies.

The field of AI advances in tandem with the sophistication of testing simulations that are framed for it. From Chess, to Atari, from Go, to Starcraft, the algorithms capacity to learn tends to fit the complexity of the simulation in which it is tested. Standardized problem sets, or games, lead to re-producible results, and thus the scientific method can be applied to advance the field and disseminating working examples. The reason that tokenspice2 is so profound is because of it’s capacity to be a next generation sandbox for AI research in economics. Just like Starcraft, it is an incredibly complex and meaningful environment that has quantifiable outcomes: in Starcraft, “did we win the game?”, in economics, “did we make more money?”.

Since this talk is directed to a study group of token engineers, a background in token engineering is not the focus. I’ll give more background on reinforcement learning, since that is the lesser known subject, and to be honest, it’s my favourite subject of all.

Silver. https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf

Reinforcement Learning is as profound as it is powerful. There is so much to be learned about the human experience and the human mind from studying reinforcement learning. For example, a fundamental topic of RL is the exploration, exploitation trade-off. Exploration is trying new things, taking risks, making discoveries. Exploitation is utilizing the information that we have, using our understanding of the world to make the right decisions. Do you take the risk of quitting your stable job to work in the phantasmal, high flying, volatile and fantastic blockchain industry? Or remain a suitor of hierarchical promises to ascension and popular approval? It turns out, that a dynamic balance of both is required to perform well in reward optimization. Typically, exploration is favoured at the beginning of time, and exploitation towards the end. Be mindful of your objective function. At LTF we warn the dangers of mono-rewards optimization and always advocate for multi-dimensional objectives.

Another analogous generalized experience between humans and machines is the effectiveness of constant negative reward, this was shown to be an essential piece of the recipe in basic RL tasks such as maze solving. If an agent does not have constant negative reward, then they have no incentive to finish the game sooner, this speaks to the fundamental insight that came to Buddha, as The First Noble Truth Duhkha, Life is Suffering. I’m sure that every reader will be familiar with this experience, you find yourself blessed, with a cherished life, grateful for everything around you, yet you can’t shake this deeper, underlying anxiety. The void. This pull that you feel inside of you. That’s constant negative reward and it’s how all of your ancestors survived.

Another example is lambda, the discount factor. This parameter shows up everyone in RL, and it is the amount that we discount future rewards in favour of immediate rewards, this is obviously an important parameter to consider as humans, how much should we invest in our future vs pursuing short term gratification?

All of the lessons above were learned from my study under Dr. Richard Sutton at University of Alberta, and I highly recommend that you checkout his book. Indeed, life is just windy grid world (example 6.5 in Sutton’s text). The analogies between the human experience and the field of reinforcement learning are endless, this is why to me, it appears to be such a promising lead on the construction of general intelligence.

Silver. https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf

Reinforcement learning is significantly different from supervised and unsupervised learning. Supervised, has labled targets for each observation, unsupervised has no labels, and reinforcement learning has labels that are generated on the fly by an agent-environment interaction. Reinforcement learning is unique in its temporal nature, it can be formally modelled as a markov decision process, in which the agent finds itself in a particular state at each step in time.

Silver. https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf

Reinforcement Learning has two fundamental data structures, the agent, and the environment.

Steps to creating a reinforcement learning setup:

  1. Define the reward that our agent is trying to optimize
  2. Define the action set that can be sampled from the agent
  3. Define the observation set that can be sampled from the environment
  4. Define the agent’s policy that maps observations to actions
  5. Define the process in which the agent updates its policy
  6. Run the agent through experiences or simulations to allow it to optimize its policy

Stable Baselines conforms to the standard Agent-Environment architecture of the the OpenAI Gym framework. Let’s run a basic example.

Play around with Cart Pole, the hello world of RL! This was an awesome environment to study RL in back when I was studying under Dr. Sutton. See example 3.5 in the book for the full breakdown of this setup.

Careful with the device option that is passed to the agent. The Cuda language is how gym sends work to the GPU, it may not work with older graphics cards or AMD chips. The GTX 670 on my desktop is too old to be running the default deep learning backend for stabe_baselines3 (I believe it’s pytorch). So I hopped over to my laptop which has a more modern GPU. Alternatively, you can change device parameter to 'cpu', but be warned, the training will be incredibly slow.

The above example is incredibly simple, but it contains all of the essential pieces of reinforcement learning, the agent, the environment, the action space, the observation space, an objective, and a policy. The algorithm used is the Proximal Policy Gradient, which learns from online experience, and maintains a multi-layer-perceptron policy that maps observations to action, as well as an Advantage function that estimates the value of particular actions in particular states.

If you want to continue diving deep into the ocean of RL, checkout the research papers published by Google’s Deepmind. For a nice introduction to a formal understanding of Deep Reinforcement Learning, I recommend searching out the content of David Silver. For more sensationalized and altruistic research, checkout Elon Musk’s initiative OpenAI.

OpenAI’s mission is to ensure that artificial general intelligence (AGI) — by which we mean highly autonomous systems that outperform humans at most economically valuable work — benefits all of humanity. We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

Economics is right there in the middle of their mission statement so it must be important. Now that we have a deeper understanding of RL, let’s frame it in the context of token engineering and tokenspice2.

In the past few years since the economic collapse of crypto in 2018 after the ICO bubble burst, there was a rush of mathematicians and engineers into blockchain land. It became clear that there was a need for rigorous systems engineering and verification methods in crypto-economic systems. Around this time, the Token Engineering community hits the scene, Blockscience is working on cadCAD, and systems thinkers around the world go to work around the world building a next generation deployment of economic thinking. A gambit of mental models emerge, inspired by models such as design thinking and automation control theory into an emerging discipline called token engineering.

Voshmgir, Zarghamhttps://epub.wu.ac.at/7782/1/Foundations of Cryptoeconomic Systems.pdf

The ethereum virtual machine is a global, decentralized computer that allows for storage of data and execution of code. The EVM maintains a set of accounts, each account being owned by a private key, and having an associated public key. Accounts can be internal or external. Internal accounts are smart contracts, which have an external owner, and external accounts are wallets held by humans or machine users. The EVM maintains a universally agreed upon state of it’s computation, the synchronicity of this state is achieved via the Ethereum blockchain, a consensys mechanism that uses proof of work to maintain a linked list of hashed blocks, referencing backwards in time.

The ethereum main net is the primary chain that is typically referred to when people discuss ethereum. There are also test networks such as rinkeby which enables testing dapps across the web for no cost. Using Ganache, we can run our own ethereum blockchain right on our local system, and use it to test smart contracts. Tokenspice2 enables us to test crypto-economic networks from a token modelling point of view.

Ganache running a local development chain.

The ability for TokenSpice2 to access any option of these chains is its one of its most powerful features. It enables us to utilize the analogue signal of the EVM for our SPICE. This leads to an architecture referred to as ‘Simulator in the loop’.

Simulator in the Loop Architecture — A concept described by Trent McConaghy.

I could see this being a very good architecture for optimizing agent policies for performance metrics such as maximum wallet value or accuracy in modelling the properties of the live system, ethereum main net in this case. This architecture could go hand-in-hand with the reinforcement learning strategy of priority experience replay by the agents. In this case, online data would be collected and then used in the simulator to seed realistic simulations for the agents. The signal from the live market would carry analogue signals that reflect the entropy and emergent properties of the real marketplace.

To run the default simulation, just follow the instructions on the GitHub page clearly. LTF has made a fork of the repo, our version is streamlined for Linux environments, if you are using Windows, check the original OceanProtocol version. LTF plans on being a community maintainer of Tokenspice as we see it fitting into our broader mission of seeding and optimizing the longtail economy.

TokenSpice2 is super fun to run. It’s well designed, well engineered, well documented, and well-tested. It was so easy to have it deploy a ganache network, migrate the ocean protocol contracts to the network, and then have agents interact with those contracts on the local chain. The included driver file run_1.py is a great example of how to get started with running a simulation, and the state of the simulations was very easy to identify and modify given that they are all labelled as magic numbers. Reading the agent modules is a joy as it clearly defines the state spaces of the agents and gives an idea of the bounds of the system. By identifying all of the agents and their actions, one can have a complete picture of the bounds of the network.

The original purpose of TokenSpice2 is to model the ocean protocol ecosystem. The ecosystem is comprised of Data Publishers, Data Speculators, and Data Consumers. There is a grant component to the ecosystem here as well, where grant takers can receive funds from grant givers, and there are several protocol agents like the pool, router, burner, and minter.

The states progression over time can be observed in time series plots of the important variables such as Ocean DAO income, amount of OCEAN minted and burned, the R&D spend per year, and the fundamental valuation of the network. The following are a few plots from running Tokenspice2 with its default configuration.

The simulation so far has simple probabilistic behaviour built into the agents in regards to buying and selling OCEAN and staking and unstaking on Ocean Data Pools. An intelligent agent would be concerned with the state of the ecosystem when making its decisions. This is the magic of RL. We can have the policies of the agents change over time based on the changing state of the environment. Agents can also adapt to their influence on the environment.

What if we give the agents the objective of maximizing their OCEAN holdings? Simulations could be run for hundreds of years to give the agents enough time to adapt their policies to being highly effective. How will that effect the overall price of the token over time? Speculation feedback loops may be identifiable. Let’s sketch up what this might look like:

Protocol Speculator Agent

The observations space is a 4 dimensional vector of:

  • Steps since previous speculation
  • Current price of OCEAN
  • OCEANDAO Revenue
  • Ocean Network Valuation

The Action Space is a tuple ({0,1,2}, p)

  • 0 = No action
  • 1 = sell BPT
  • 2 = stake OCEAN
  • p = percentage to sell or stake

The policy is optimized by the RL algorithm.

See the original version of this essay for a pseudo-code implementation of creating a stable-baselines agent that learns and makes decisions inside of the tokenspice simulation. This idea was later put into implementation in collaboration between Marc Minnee and myself for an energyweb solution at the Ocean protocol data hackathon in January 2021. We give an overview of our solution along with a deep dive into the technical components in a two-part episode of the token engineering commons lab from February 2021.

My goal has been to inspire you to consider the importance of understanding the intersection of these two fields, Token Engineering and AI. I hope that you now see the potential for the impending realization of Nature 2.0. It’s more important now then ever that we steward these technologies as being abundance machines for public good and the commons. We need narrative weavers, policy makers, musicians, artists, philosophers, scientists, and engineers to play crucial roles in the steering of these emergent systems. To become a steward of this technology yourself, get involved in the commonsstack, token engineering commons, longtail financial, oceanprotocol and The Token Engineering Academy.

Special thanks to Angela Kreitenweis for encouraging me to give this talk in the Ocean Protocol Study group, and for the incredible work she has done in the TEA. Special thanks to Trent McConaghy for pioneering such bold ideas in academia, startupland, and now blockchainland, with integrity and good intentions, and for being such an integral mentor for the community and the field of token engineering.

Don’t just read about Nature 2.0, live it! You can start by following YGG on twitter for nature 2.0 in real time.

Happy Token Engineering!

Yogi Hacker