Bruce Lee random walk

[Julia blog] On Machine Learning and Programming Languages

06 Dec 2017

Any sufficiently complicated machine learning system contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of a programming language.1

By Mike Innes (Julia Computing), David Barber (UCL), Tim Besard (UGent), James Bradbury (Salesforce Research), Valentin Churavy (MIT), Simon Danisch (MIT), Alan Edelman (MIT), Stefan Karpinski (Julia Computing), Jon Malmaud (MIT), Jarrett Revels (MIT), Viral Shah (Julia Computing), Pontus Stenetorp (UCL) and Deniz Yuret (Koç University)

As programming languages (PL) people, we have watched with great interest as machine learning (ML) has exploded – and with it, the complexity of ML models and the frameworks people are using to build them. State-of-the-art models are increasingly programs, with support for programming constructs like loops and recursion, and this brings out many interesting issues in the tools we use to create them – that is, programming languages.

作为编程语言(PL)圈子的人,我们非常关注机器学习(ML)的爆炸式发展,伴随着ML一起发展的还有ML模型的复杂度以及人们正在用来构建它们的框架。目前最好的模型越来越多,支持循环和递归等编程结构,这给我们用来创建它们(即编程语言)的工具带来了许多有趣的问题。

While machine learning does not yet have a dedicated language, several efforts are effectively creating hidden new languages underneath a Python API (like TensorFlow) while others are reusing Python as a modelling language (like PyTorch). We’d like to ask – are new ML-tailored languages required, and if so, why? More importantly, what might the ideal ML language of the future look like?

然而机器学习还没有一个专门的语言,人们在有效地创造隐藏在Python API(如TensorFlow)下面的新语言方面做了一些努力,而其他人则使用Python作为建模语言(如pytorch)。我们想问一下,是否需要新的ML定制语言,如果是,为什么?更重要的是,未来理想的ML语言会是什么样子?

Pig Latin, and Other Hidden Languages

故意颠倒英语字母顺序拼凑而成的行话,以及其它隐藏语言

TensorFlow (TF) and its ilk2 are already programming languages, albeit limited ones. This may seem surprising given that one uses Python to program TF. However, consider that TF requires you to write Python code to build an expression tree in its internal language, which it then evaluates.

TensorFlow(TF)以及它的同类产品已经算是编程语言,尽管很有限。考虑到使用Python来编程TF,这似乎令人惊讶。但是,考虑到TF需要你编写Python代码,在其内部语言中构建一颗表达树,然后对其进行求值。

In fact, you can program in “lazy” TensorFlow style in any language. Consider the following JavaScript code, which implements a trivial function (add) in this style:

事实上,你可以在任何语言中以“惰性”Tensorflow风格进行编程。考虑下面的JavaScript代码,它以这种风格实现了一个普通的函数(add):

function add(a,b) {
  return `${a}+${b}`;
}
x = 1; y = 2
z = add('x', 'y') // 'x+y'
eval(z) // 3
x = 4
eval(z) // 6

Here we are metaprogramming – writing code that writes code. In this case both the meta-language and the target language are the same (JavaScript) but they could just as well be different languages (as in the C preprocessor for the C language), or we can use a data structure (an AST) instead of a string – the principle is the same. In TensorFlow, Python serves as a meta-language for writing programs in TF’s graph-based language.3 If you’re not convinced, consider that TensorFlow’s graph even supports constructs like variable scoping and control flow – but rather than using Python syntax, you manipulate these constructs through an API.

这里,我们进行的正是元编程,即编写能够产生代码的代码。在这种情况下,元语言和目标语言都是相同的(JavaScript),但它们也可以是不同的语言(如C语言的C预处理器),或者我们可以使用一个数据结构(如AST)而不是字符串——原理是相同的。在TensorFlow中,Python作为一种元语言用于编写TF基于图的语言中的程序。如果你不信,考虑TensorFlow的图甚至支持变量作用域和控制流等结构–但并没有使用Python语法,你可以通过一个API操纵这些结构。

TensorFlow and similar tools present themselves as “just libraries”, but they are extremely unusual ones. Most libraries provide a simple set of functions and data structures, not an entirely new programming system and runtime. Why is such a complex approach necessary?

Tensorflow和类似的工具,它们仅仅像库一样存在,但它们是非常与众不同的。大多数库提供了一组简单的函数和数据结构,而不是一个全新的编程系统和运行时。那为什么需要这种复杂的方法呢?

Why create a new language?

为什么要创造一种新的编程语言呢?

The core reason for building new languages is simple: ML research has extremely high computational demands, and simplifying the modelling language makes it easier to add domain-specific optimisations and features. Training models requires excellent hardware support, good numerics, low interpreter overhead and multiple kinds of parallelism. Where general-purpose languages like Python struggle to provide these features, TensorFlow can handle them seamlessly.

构建新语言的核心原因很简单:ML研究有极高的计算需求,简化建模语言使得添加特定领域的优化和特征变得更容易。训练模型需要良好的硬件支持,良好的数值、低解释器开销以及多种并行。Python这样的通用目的语言艰难努力地提供这些特征,而TensorFlow可以无缝处理。

There’s a snag, though. These impressive optimisations rely on simplifying assumptions (ML models won’t be recursive, or need custom gradients, right?), which make it easier to apply optimisations or deploy to small devices. Unfortunately for engineers, model complexity has increased and researchers thoroughly enjoy violating these assumptions. Models now demand conditional branching (ok, easy enough to hack in), loops for recurrence (less easy but possible), and even recursion over trees (virtually impossible to deal with). In many areas of ML, including neural networks and probabilistic programming, models are becoming increasingly like programs, including ones that reason about other programs (e.g. program generators and interpreters), and with non-differentiable components like Monte Carlo Tree Search. It’s enormously challenging to build runtimes that provide complete flexibility while achieving top performance, but increasingly the most powerful models and groundbreaking results need both.

虽然有一个障碍。这些令人印象深刻的优化,依赖于简化假设(ML模型不会是递归的,或需要自定义梯度,对吗?),这使得应用优化或部署到小型设备上变得更容易。不幸的是,工程师的模型复杂度越来越高,研究人员非常喜欢违反这些假设。现在模型需要条件分支(好,很容易侵入),用于递归的循环(不太容易但还是可能的),甚至树上的递归(几乎不可能处理)。在ML的许多领域(包括神经网络和概率编程)中,模型正变得越来越像程序,包括推理其它程序(如程序生成器和解释器)的程序,以及含有不可分的组件,如蒙特卡罗树搜索。构建提供完全灵活性同时实现高性能的运行时是极具挑战的,但这也是越来越多最强大的模型和突破性结果都需要的。

image Using ML with complex tree-structured data, like the Stanford Sentiment Treebank, requires differentiable, recursive algorithms.

Another practical downside of this approach, at least in its current incarnations, is the need for meta-programming of the kind discussed above. Building and evaluating expression trees imposes significant additional burdens on both the programmer and the compiler. It becomes tricky to reason about because the code now has two execution times, each with different language semantics, and things like step-through debugging are much harder. This could be resolved by creating a syntactic language for the new runtime, but this means no less than creating a full new programming language. Is this worthwhile when we have popular numerical languages already?

这种方法的另一个实际缺点是,至少在目前阶段,需要上面讨论的这类元编程。构建、评估表达树给程序员和编译器带来了额外的负担。原因很难理解,因为现在代码有两个执行时,每个都有不同的语言语义,像单步调试这样的事情要困难得多。可以通过为新的运行时创建一个句法语言来解决这个问题,但这并不意味着要创建一个全新的编程语言。当我们已经有流行的数值语言时,这还值得一做吗?

Can we just use Python?

我们可以只使用Python语言吗?

As ML models began to need the full power of a programming language, Chainer and others pioneered a “define-by-run” approach wherein a Python program is itself the model, using runtime automatic differentiation (AD) for derivatives. This is fantastic from a usability standpoint: If you want a recursive model that operates over trees, simply write that down, and let the AD do its magic! The difference in feel is hard to overstate, and a frictionless approach to playing with novel ideas is invaluable to research.

当ML模型开始需要一个编程语言的全部威力时,Chainer和其它机构使用运行时导数的自动微分(AD)开创了一个“运行定义”的方法,其中Python程序本身就是模型。从可用性角度来看,这很好:如果你想要一个在树上操作的递归模型,只需写下来,让AD发挥魔力!感觉差别就是很难夸大,摆弄新奇想法的无摩擦方法是对于研究来说是非常宝贵的。

However, getting Python to scale to ML’s heavy computational demands is far harder than you might expect. A huge amount of work goes into replicating optimisations that fast languages get for free, and the PL boneyard is full of high-profile yet failed efforts to make Python faster. Python’s semantics also make it fundamentally difficult to provide model-level parallelism or compile models for small devices.

然而,将Python扩展到ML的高额计算需求远比你想象的要困难得多。复制快速的语言可以免费获得的优化需要大量的工作,PL墓地中充满了使Python更快的高调但失败的努力。Python语义也使得它很难提供模型级并行或编译小型设备的模型。

Efforts like Gluon for MXNet are finding ways to get the best of both, at least to some extent. The idea is to combine basic dynamic AD with code-tracing approaches that produce “static sub-graphs” that can be optimised. This is unfortunately something of a mashing together of disparate implementations and APIs. It’s also limited; MXNet uses its graph not just for kernel-level optimisations but also for high-level graph scheduling, such as splitting a model across multiple GPUs. It’s unclear how these hybrids will handle this, other than by adding another new API for graph containers whose nodes can be dynamic computations.

像MXNet的Gluon这样的工作至少在某种程度上也是在寻找两者最好的方式。想法是将基本的动态AD与能产生可以被优化的“静态子图”的代码追踪方法结合起来。这是一个将不同的实现和APIs混在一起的东西。它也是很有限的;MXNet不只是在内核级优化中利用其图也在高级图调度(如分割横跨多个GPU的模型)中使用。目前还不清楚这些混成方法除了为节点可以动态计算的图容器添加另一个新的API外,将如何处理这个问题。

What might a tailor-made ML language look like?

一个特制的ML语言会是什么样子?

There are few domains as demanding about language-level design issues as machine learning. But it’s not unprecedented, and in areas like formal reasoning and verification or cluster computing, new, tailor-made languages have proved an effective solution. Similarly, we expect to see new or existing languages customised for the kind of numerical, differentiable, parallelisable and even probabilistic computations needed in ML.

很少有领域像机器学习一样对语言级设计问题提出要求。但这并不是空前的,在诸如形式推理和验证或集群计算这样的领域中,新的特制的语言被证明是一种有效的解决方案。同样地,我们希望看到为ML所需要的这类数值的、可微的、可并行的甚至概率计算定制的新的或现有的语言。

An obvious current challenge for ML languages is achieving generality alongside performance, and the early hybrid approaches will need much more development. We expect that future ML runtimes will need to support arbitrary mixing of approaches (computational graphs that are static within dynamic within static …) and will need to get better at compiling dynamic code for deployment. Ideally, there will only be single, flexible “graph format” (or AST). The AST should have a syntax and statically describe dynamic behaviour (e.g. with a written for loop) – in other words, it should look a lot more like a standard programming language.

ML语言目前面临的一个明显挑战是在性能上实现通用性,而早期的混成方法需要更多的发展。我们希望未来ML运行时需要支持方法的任意混合(静中有动,动中有静的计算图),需要更擅长编译部署过程中的动态代码。理想情况下,只有单一的、灵活的“图格式”(或AST)。AST应该有一个语法,静态地描述动态行为(如用for循环编写)——换句话说,它应该看起来更像一个标准的编程语言。

Programmable semantics would open up new levels of flexibility, and could be provided by a feature similar to macros. This would allow features like multi-GPU training to be built on top of the core system, by specifying where the code should have pure dataflow semantics (as opposed to standard imperative semantics, which are more flexible but may include side-effects that are unsafe to optimise). It could also allow the kinds of program manipulation needed by probabilistic programming languages, or the vectorisation (batching) passes usually implemented by hand in NLP models.

可编程语义将打开灵活性的新水平,并且可以由类似于宏的特性提供。这将允许多GPU训练这样的特性通过指定代码何处应该有纯数据流语义(与标准命令式语义相反,后者更灵活,但可能包含对优化不安全的副作用)构建在内核系统的顶部。它也允许这类概率编程语言所需要的程序操作,或通常通过NLP模型实现的向量化(批量)。

As well as the PL community, ML engineers should pay close attention to the traditional Automatic Differentiation (AD) community. ML languages can take inspiration from pioneering work on languages designed for truly first-class derivative support. Such languages can easily mix symbolic with runtime techniques (helping with the tradeoffs mentioned above), mix forward and reverse mode AD (for improved performance and memory usage), and differentiate GPU kernels – all with no loss in performance.

和PL社区一样,ML工程师也应该密切关注传统的自动微分(AD)社区。ML语言可以从为真正第一类导数设计的语言的开创性工作中获得灵感。这种语言可以很容易地混合符号与运行时技术(在上述权衡的帮助下),混合前向和后向模式的AD(为了提高性能和内存使用),并区分GPU内核 - 所有这些在性能上都没有损失。

ML research will increasingly need more powerful type systems, user-defined types and more means for extension. Gone are the days when it was enough to hard-code support for strided arrays on NVIDIA GPUs; cutting-edge techniques like sparse machine learning, new hardware like TPUs, Nervana and FPGAs, and diverse deployment targets like ARM chips or the iPhone’s CoreML chip all call for greater levels of flexibility. Large-scale refactoring of core C++ code for each new development will not scale.

ML研究将越来越需要更强大的类型系统、用户定义的类型以及更多的扩展方式。在NVIDIA GPU上的步幅数组的硬编码支持完全足够的时代一去不复返;如稀疏机器学习这样的尖端技术,像TPUs、Nervana和FPGAs这样的新硬件,以及像ARM芯片或iPhone CoreML芯片这样不同的部署目标都需要更高级别的灵活性。对每个新的开发而言,大规模内核C++代码的重构不会伸缩。

Consider a world where adding new hardware support – or new kinds of data representations – could easily be accomplished by a user in high-level code, without changes to the original system. Here we expect ML systems to take inspiration from existing numerical computing languages, which can already handle these tasks with ease.

考虑这样一个世界,在不改变原系统的情况下,可以在高级代码中轻松地添加新的硬件支持(或新的数据表示)。这里,我们希望ML系统能够从现有的数值计算语言中获得灵感,这些语言可以轻松地处理这些任务。

Type systems can also have safety benefits, but current ones are not suited to array-heavy code where array dimensions are meaningful (for example, spatial vs channel vs batch dimensions in images). These distinctions are left to pure convention, and hairy dimension-permuting code has no protection from mistakes, leaving much room for more array-aware type systems. We expect the trend towards dynamic typing to continue,4 mainly due to practitioners’ preference for interactivity and scripting, but hope to see further innovations like CNTK’s optionally dynamic dimensions.

类型系统也有安全方面的好处,但当前的系统不适合涉及大量数组的代码,这里数组维度是有意义的(例如,图像中空间vs通道vs批量维度)。这些区别纯粹是习惯,危险的维度置换代码没有防止错误,给更多的有关数组的类型系统留下了很大空间。我们希望朝动态类型发展的趋势继续下去,主要原因在于实践者对交互和脚本的偏好,但我们也希望进一步看到像CNTK的可选动态维度这类创新。

ML engineers are increasingly interested in traditional software engineering problems like the maintenance and extension of production systems. The ML programming model makes it harder to create abstraction barriers and interfaces between components, and re-training of a model can easily break backwards compatibility. ML languages will likely be able to incorporate solutions to these problems just as regular languages do, but this remains an open design problem.

ML工程师对维护和扩展生产系统这样的传统软件工程问题越来越感兴趣。ML编程模型很难在组件之间创建抽象壁垒和接口,模型的重训练很容易破坏向后兼容。ML语言很可能像常规语言一样能够将解决方案纳入到这些问题中,但这仍然是一个开放的设计问题。

image Software Engineering 2.0? (via XKCD)

A downside to any new language is that it require a new library ecosystem, as only code written for the new runtime benefits from it. For example, rather than reusing the Python ecosystem, the TensorFlow developers need to rewrite libraries for tasks like image processing and file IO in the graph language, throwing out the vast effort behind projects like SciPy. This may well be the only way forward, but ML practitioners should not split from the wider numerical and HPC community. An ideal ML ecosystem is an ideal numerical one, and vice versa, and collaboration between these communities will multiply everyone’s efforts.

任何新语言的一个缺点是它需要一个新的库生态系统,因为只有为新运行时编写的代码才能从中受益。例如,TensorFlow开发者需要为图语言中的图像处理和文件IO这样的任务重写库,而不是使用Python的生态系统,这抛弃了SciPy这样的项目背后大量的工作。这可能是唯一的出路,但ML从业者不应该从更广泛的数值和HPC社区分裂。一个理想的ML生态系统是一个理想的数值生态系统,反之亦然,这些社区之间的协作将使每个人的努力倍增。

We expect to see these developments coming from several angles. Graph IRs and formats like XLA, ONNX and NNVM are becoming ever more sophisticated and will likely take more inspiration from traditional language design,5 perhaps even adding surface syntax to become fully-fledged programming languages. TensorFlow’s XLA has started a push towards special-purpose compiler stacks that now includes TVM, DLVM, myelin, and other ongoing work. Meanwhile, projects like the PyTorch JIT, Gluon and Tangent are efforts to make Python itself a better modelling language, in spite of the significant challenges. Having just argued that ML is a numerical programming languages problem, we in the Julia community feel that it is an excellent substrate for experimenting with these kinds of language-level issues, and will continue to push the boundaries with projects like Knet, Flux, Cassette, CUDAnative, DataFlow.jl, and more.

我们期望从几个角度来看待这些事态发展。图IRs和XLA、ONNX以及NNVM这样的格式变得越来越复杂,可能会从传统的语言设计中获取更多的启示,甚至会添加表面语法成为完全成熟的编程语言。Tensorflow的XLA已开始向专用编译器栈(现在包括TVM、DLVM、myelin以及和其它正在进行的工作)推进。同时,PyTorch JIT、Gluon和Tangent等项目的工作使得Python本身成为更好的建模语言,尽管挑战很大。刚刚认为ML是一种数值编程语言的问题,我们在Julia社区觉得这是尝试这些语言层面问题的一个极好的底层,并将继续推动与Knet、Flux、Cassette、CUDAnative、DataFlow.jl以及其它更多项目的界限。

Conclusion: An Inference about Machine Learning

结论:机器学习的推论

Machine learning models have become extremely general information-processing systems that build ever higher-level and more complex abstractions; recurrence, recursion, higher-order models, even stack machines and language interpreters, all implemented as compositions of basic components. ML is a new programming paradigm, albeit a strange one that’s heavily numerical, differentiable and parallel. And as in any engineering field, the tooling available will have a profound impact on the scope and quality of future work.

机器学习模型已经成为非常普遍的构建更高层次更复杂抽象的信息处理系统;循环、递归、高阶模型,甚至堆栈机和语言解释器,都是作为基件的组成部分来实现的。ML是一种新的编程范式,尽管它是一种奇怪的大量数值、可微、并行的模型。和任何工程领域一样,现有的工具将对未来工作的范围和质量产生深远的影响。

All this suggests that designers of ML systems have a momentous challenge ahead of them. But while that’s true, there’s some good news: The very same problems have been deeply explored, if not already solved, by language researchers over the last few decades! To really take this new field to its full potential, the machine learning and programming languages communities will have to combine forces, and the real challenge is to integrating the disparate expertise of these two groups.

所有这些都表明,ML系统的设计者面临着重大的挑战。但尽管如此,还是有一些好消息:过去几十年中,即使还没有被解决,语言研究者们也已经深入探讨了同样的问题!要真正发挥这一新领域的全部潜力,机器学习和编程语言社区必须联合力量,真正的挑战是整合这两个群体的不同专业知识。

Can we build systems that treat numerics, derivatives and parallelism as first-class features, without sacrificing traditional programming ideas and wisdom? This is the foundational question which languages over the coming decade will have to answer.

我们可以在不牺牲传统的编程思想和智慧的情况下,构建视数值、导数和并行为第一类特征的系统吗?这是未来十年语言必须要回答的基本问题。

  1. After Philip Greenspun 

  2. We use TensorFlow for example, but could substitute other “define-before-run” frameworks like CNTK or MXNet. 

  3. TensorFlow’s graph is effectively a dataflow-based AST (Abstract Syntax Tree). 

  4. Though we note that, internally, current systems span the gamut from fully dynamic (PyTorch and its ATen backend) to unusually static (TensorFlow’s XLA and MXNet, where all dimensions are known before the graph is run). 

  5. Google Brain’s increasing hiring of programming languages experts, such as Chris Lattner, is an interesting development on this point.