It's Not Just Ruby
A few weeks ago, I had a need to parse Jasmine's jasmine.yml in some C# code. I spent some time looking at existing YAML parsers for .NET and ended up deciding that spending a couple of hours writing a lightweight, purpose-specific parser for jasmine.yml made more sense for my use case than including an off-the-shelf YAML parser which invariably turned out to be quite a heavyweight project.
Having made this choice, I then spent a little bit of time reading the YAML specification. So when, a week or so later, the first of what would become a series of malicious-YAML-based attacks on Rails began to hit the news, I started paying attention. A couple weeks after that, yet another YAML-based security flaw was corrected in Rails, and then rubygems.org was compromised in still another malicious YAML attack separate from the Rails bugs. This one had the risk of compromising any machine which runs
gem install on a regular basis, although it doesn't seem like that actually happened at this point.
Because all of these attacks landed within the Ruby community, observers have occasionally characterized this as a crisis for Rails or Ruby. I think that's (at least a little) misguided. The real focus of our attention should be YAML.
Update: Aaron Patterson, one of the maintainers of Ruby's Psych parser, has an excellent discussion of the Ruby-specific aspects of this issue.
It is very easy to demonstrate that the same vulnerabilities exist in other platforms. Here's a two-line "attack" on Clojure's YAML parser from Justin Leitgeb. I have to use the term "attack" here loosely, because all he is doing is deserializing an arbitrary class, which the YAML spec allows. But, in most environments, deserializing an arbitrary class is tantamount to code execution. The PyYAML library for Python has nearly the same vulnerability (though there is a workaround for it). I think that YAML-based attacks have tended to target Ruby projects simply because use of YAML is quite a bit more common amongst Rubyists, and certain prominent Ruby libraries in an internal way — users of these libraries may have no idea that the JSON input they supply, e.g., might be vulnerable to an internal YAML parser.
User-Controlled, Arbitrary Object Instantiation is Remote Code Execution
The introduction to the YAML spec states,
YAML leverages these primitives, and adds a simple typing system and aliasing mechanism to form a complete language for serializing any native data structure.
That is, in practice, remote code execution. If an external actor, such as the author of a YAML document, can cause your application to instantiate an arbitrary type, then they can probably execute code on your server.
This is easier in some environments than others. If you happen to be running within a framework where one of the indispensable types calls
eval on a property assignment, then it is very, very easy indeed.
On the other hand, even in environments which make a very careful distinction between data type construction and code execution, one can imagine vulnerable code. Consider the following Haskell data type:
type LaunchCodes = (Int, Int)
…and some code, elsewhere in the application:
case input of
LaunchCodes (targetId, presidentialPassword) -> launchMissiles( --…
Contrived, sure. But you may have more innocuously-named types which you don't plan for random users to spin up. Haskell, indeed, makes such attacks harder, but it's not a free pass.
It's difficult to overstate the danger of remote code authentication. If someone can execute code on your server, they can probably own your data center.
The YAML spec is largely mute on the issue of security. The word "security" does not appear in the document at all, and malicious user input isn't discussed, as far as I can see.
Types in the YAML Spec
The encoding of arbitrary types is discussed in the last section of the YAML spec, "Recommended Schemas." It specifies "tag resolution," which is, in practice, the mapping of YAML content to instantiated types during deserialization. This section defines four schemas which a compliant parser should understand. The first three, "Failsafe," "JSON," and "Core," define , define tag resolutions for common types like strings, numbers, lists, maps, etc., but don't appear dangerous to me.
However, the last section of "Recommended Schemas" is a catchall called "Other Schemas." It notes,
None of the above recommended schemas preclude the use of arbitrary explicit tags. Hence YAML processors for a particular programming language typically provide some form of local tags that map directly to the language’s native data structures (e.g.,
While such local tags are useful for ad-hoc applications, they do not suffice for stable, interoperable cross-application or cross-platform data exchange.
In practice, most YAML deserialization code will, by default, attempt to instantiate any type specified in the YAML file.
Some YAML parsers have a "safe" mode where they will only deserialize types specified in the tag resolution. For example, PyYAML has safe_load. Its documentation notes,
Warning: It is not safe to call yaml.load with any data received from an untrusted source! yaml.load is as powerful as pickle.load and so may call any Python function. Check the yaml.safe_load function though.
(emphasis in original)
Notably, however, Ruby's Psych parser has no such method at this time.
So Should We All Dump YAML?
Some Rubyists have questioned why YAML is so pervasive in the Ruby community when other formats, like JSON or Ruby itself (à la Rake), are perfectly usable in most cases. It's a good question to ask, especially in cases where YAML parsing has not been explicitly requested.
On the other hand, it's not hard to imagine cases where allowing arbitrary object instantiation makes sense, such as in an application configuration file. An example in the .NET space would be XAML files. If you are defining a form or a workflow, then you want to be able to allow the instantiation of custom controls. There is no standard way to do this with, say, JSON files, so using a format like YAML makes sense here. (So does using a non-domain-specific language like Ruby, but that presumes that Ruby is available and might not be suitable for cross-language work.)
For the most part, you never want to accept YAML from the outside world. The risks are too high, and the benefits of the YAML format are largely not relevant here. Note, for example, that while gem files include manifests in YAML format, the jQuery plugin repository does essentially the same thing with JSON documents.
Why Don't YAML Parsers with a "Safe" Mode Use It By Default?