Preface
Information available here comes from personal experience and industry practices.
Many concepts apply not only to Machine Learning platforms but to other systems as well.
Latest posts
Goal
While most Machine Learning platforms are either organization internal or offered as a service, this platform prioritizes a different aspect: to provide an avenue for expert Machine Learning engineers that connects them from implementation to users.
Principles
These principles are stated first and foremost to provide context to every architectural design decision. They aim to inform the trade-offs and consequences of every choice. Note that neither end of the spectrum is intrinsically beneficial or detrimental to the architecture — finding the right balance is critical.
- Integrated experience — modular architecture
- Fundamentally stateful — practically stateless
- Use when possible — build when needed
- Cohesive design — organic evolution
An integrated experience means that systems are not isolated from their environment. From user interaction to communication with other services, the best experience is achieved via human-centered tight connections between components, avoiding context switches. In contrast, a modular architecture defines clearly separated components with interfaces through which data is transferred across the system. Maintenance is heavily influenced by modularity, as dependencies between components increase development effort to introduce changes.
Applications always are fundamentally stateful. To ignore this fact and instead assume that each system component are pure computation units leads to ad-hoc interfaces and de-facto contracts between services. Acknowledging that data management is the foundation for every other element of the platform is the first step. State is organized in resources defined declaratively and interfaces are resource-driven. In contrast, deployed services should run as practically stateless as possible. That is, storage and compute are logically independent entities that maintain and mutate state, respectively.
To use when possible refers to reaching for already-available tools, practices and information commonly used in the industry. Avoid lock-in in an internal solution with limited functionality if there are reasonable open-source or commercial off-the-shelf solutions. In most cases, however, the cost of integrating an existing solution is much higher than a tailor-made one that solves a very specific problem. To build when needed is knowing when to choose one over the other, since the decision strongly depends on the consequences of migrating to the other option later in time.
Lastly components should follow a cohesive design over time. It is not sustainable in practice, of course, for every aspect of the platform to be implemented following the same style. Aiming for this level of standardization helps maintenance, improves developer experience and lowers time to onboard. On the other hand, enforcing necessarily strict rules requires human reviewers, additional effort spent on developer tools and new team members have to learn them. Most importantly it limits the capacity of engineers and scientists to experiment and increases the barrier of entry to people unfamiliar with the system. That is, it reduces the organic evolution of the system. Because users and developers are the closest to day-to-day issues they should be encouraged to fix those issues without much overhead.