Back in November 2016, DigitalOcean released go-libvirt, an open source project containing a pure Go interface to libvirt. Using go-libvirt, developers could manage virtual machines leveraging all the power of libvirt’s extensive API without leaving the comfortable environment of Go. But there was a catch.
While the libvirt library has close to 400 API calls, initial versions of go-libvirt implemented only a handful of those calls. But go-libvirt is open source, so you can just add your own implementations for the routines you need, right?
Well, yes you could. But go-libvirt talks to libvirt by exchanging XDR-encoded buffers using an RPC mechanism based on the venerable ONC RPC (or Sun RPC), so you would first have to familiarize yourself with those RPCs. Then, you would have to locate the argument and return value structures in the libvirt protocol definition file, and write code to marshal and unmarshal them on send and receive. By that time you might be asking yourself, “Why don’t I just give up and use CGO?” But hang on. Tedious, repetitive work; that sounds like what we invented computers for. Maybe they can help?
This is the tale of how we used code generation to extend go-libvirt to cover every one of the libvirt API calls, and how we made it more resilient to future changes in the libvirt API.
Sun RPC and the Missing Toolchain
If you’re working with Sun RPC in C, you write a protocol file describing the messages you want to exchange and feed it to a utility called “rpcgen”. The output of rpcgen includes header files and stubs for both client and server. The stubs contain generated code to marshal and unmarshal the message bodies for each of the messages. This is exactly how libvirt works— the protocol files are right there in the libvirt source repo (look for source files ending in .x), and during the build they get processed by rpcgen into .c and .h files.
If rpcgen could output Go code we’d be all set, but it doesn’t. Sun RPC isn’t a popular option for native Go programs, and although there are libraries for handling its on-the-wire data representation—XDR—there aren’t any libraries around for parsing its protocol files into Go.
Time to roll up our sleeves!
Learning the Language
Sun RPC protocol files look a lot like a collection of C declarations. We could throw a parser together with regexes and custom code, but when the source files start getting complex that path often ends in tears. The protocol files we need to parse definitely meet the complexity threshold: like C, data types can be nested inside other data types, and this is exactly the kind of thing that regexes are ill-equipped to handle. To be reliable we’ll want a real stateful parser. We could write one, but there’s a better way.
Parser generators have been around since the 1970s, and Go includes a port of one of the oldest, yacc, in golang.org/x/tools/cmd/goyacc. Using goyacc to generate our parser means we don’t have to write the state machine that makes up the bulk of the parser by hand (and yes, it also means our code generator is itself generated). With a generated parser we’re left with three pieces of code to write: the language grammar, which is consumed by the parser generator to build the parser state machine, the actions, which run when the parser identifies a bit of grammar, and the lexer.
The grammar definition lives in its own file, sunrpc.y, and goyacc uses the same syntax for the contents of this file as yacc did before it. Luckily, some of the documentation for Sun RPC includes grammar definitions in exactly the format goyacc expects, and we used that as a starting point for writing the grammar.
The actions are simply Go code mixed in with the grammar. When the parser identifies an element of the grammar, it will execute any actions defined at that point in the grammar file. In our case, the actions build an internal representation of the protocol file that we’ll use later to output our generated Go code.
The actions are simply Go code mixed in with the grammar. When the parser identifies an element of the grammar, it will execute any actions defined at that point in the grammar file. In our case, the actions build an internal representation of the protocol file that we’ll use later to output our generated Go code.
Alexa, Where’s My Lexer?
That leaves the lexer, also called a tokenizer. The lexer is called by the parser, and each time it’s called it returns the next token in the input stream, where a token is a unit of the grammar. For our grammar, if the input stream looks like this:
REMOTE_STRING_MAX
The lexer will return the token CONST, then IDENTIFIER, =, CONSTANT, and ;. That matches one of the valid forms of const_definition from our grammar file (const_ident is elsewhere defined as IDENTIFIER):
IDENTIFIER
The Go code inside the braces after the grammar is the action the parser will execute when it sees this sequence of tokens. So the parser will call AddConst(), passing in the value of the second and fourth tokens, in this case the const_ident and the CONSTANT. The resulting call will be AddConst(“REMOTE_STRING_MAX”, “4194304”), because in our grammar the value of any token is the original string.
If you’re familiar with yacc, at this point you might be wondering, “Where’s the Go port of lex? Is there a golex?” The answer is no; lex isn’t part of the standard library. (To get an idea of why this might be so, and an excellent introduction to lexers in general, you might want to see this talk by Rob Pike, from back in the early days of Go, in 2011.)
So instead we have a handwritten lexer, lvlexer.go. It’s pretty straightforward, about 330 lines long, and uses no regular expressions. To work with the parser, the lexer has to satisfy an interface consisting of two functions: Lex() and Error().
Generating the Output
The actions, as well as the code that drives the parser, are found in generate.go, which gets compiled together with the lexer and the parser into a standalone binary. The generator calls the parser, and when the parser has finished its work the generator has an internal representation of the protocol file, and we need to tie everything together and output some Go code.
Up until now we’ve been talking about libvirt and Sun RPC, because libvirt is using many of the pieces that make up Sun RPC. But if you look at remote_protocol.x in the libvirt sources, you’ll notice something surprising: the procedure definitions, which would describe the argument and return types for each RPC procedure, are missing. There is an enum containing procedure numbers, but nothing that resembles a function prototype.
This is where libvirt departs from Sun RPC. Rather than use rpcgen to build the procedure stubs for client and server, they have implemented their own method for calling remote routines (have a look at libvirt’s callFull() in remote_driver.c if you’re curious).
So instead of a procedure definition in the protocol file, the procedure, its arguments, and it’s return values are associated by name. All arguments and return values in libvirt are structures. We can start from the remote_procedure enum in the protocol file. For the procedure REMOTE_NODE_ALLOC_PAGES, we have a procedure number of 347. To find the arguments structure we convert this to lowercase and add _args; for return we add _ret. We can apply this pattern to every procedure in the protocol file. If a procedure doesn’t take arguments or return values, the corresponding struct will be missing.
This gives us enough information to generate the Go client functions for each procedure. We’ll drop the remote_ prefix, since it’s common to every procedure, and we’ll convert the names to camel case so they look natural in Go. For REMOTE_NODE_ALLOC_PAGES, that means our generated Go routine would look like this:
NodeAllocPages
That’s not a bad start, but it forces the caller to construct the arguments structure and decode the return structure. Putting all the arguments into a structure makes the function awkward to use, and doesn’t match the libvirt documentation for virNodeAllowPages. We can do better.
API, Deconstructed
Because we used a real parser to process the protocol definition, our generator has full type information available. In fact for every struct definition in the protocol, we have our own Go struct containing all the type information for the struct’s elements. So the generator can replace the struct in the arguments for our generated function with a list of its contents, simply by doing this:
Gen.Procs
Gen.Procs is an array of generated procedures, and Args is an array of arguments. With that statement we set the arguments for a generated function to the array of members in the corresponding arguments structure. We do the same thing for the return values, and then our generated NodeAllocPages looks like this:
func (l *Libvirt)
This is very close to the libvirt documentation; the Go version just omits two size arguments since slices carry size information with them.
Templates
The last step in our code generator is to iterate through all the type information we’ve collected and build the actual Go files. We use Go’s text/template library with a couple of fairly simple template files to do this. The generated procedures are output to libvirt.gen.go, and some constants to a separate file, internal/constants/constants.gen.go. We even generate comments so that godoc has something to work with.
Go, Generate
Code generation is exactly what the go generate command was created for. If you’ve never explored this part of the Go toolchain before, there’s an excellent introduction in the Go blog. The generate command is intended to be run separately from go build, and much less often - typically the generated code for a project only needs to be re-created when an external dependency changes. To rebuild the generated files for a particular version of libvirt, you simply set an environment variable LIBVIRT_SOURCE with the path to the libvirt sources, and run go generate ./… from the go-libvirt directory. That command will descend through the go-libvirt sources executing generate instructions embedded in the source files.