Mar/081
The LolSharp Compiler part 1
Last Things First
In this part, we’re going to build a very simple compiler to turn a subset of the LolCode language into a .NET executable. I’ve decided to take a depth first approach, as opposed to a breadth first approach, let me explain. I could starting by writing the whole parser and then building each part of the back-end one at a time. While that’s certainly one way of doing it, I’m not that patient. In order to have something shiny to play with faster, the approach will be to build a working-but-limited program from end-to-end. Then, in subsequent parts, we’ll add new features and look at what that means for each part of the compiler. By the end of this part, the compiler will be able to handle programs like this:
HAI I CAN HAS STDIO? VISIBLE "HAI WORLD!" KTHXBYE
Exciting, I know. Feel free to download the end product if you want to play around with it first:
- Subersion – http://svn.benhoffman.net/lolsharp/stage1 (I like TortoiseSVN)
- Zip -
First Things Second
Ok, now to get the development environment set up. I’m using Visual C# 2008 Express Edition, available as a free download. If you want to use a different environment it should still be pretty straightforward. You’re going to have to download a few things from the ANTLR downloads page:
- ANTLR build tools for .NET
- Download ‘Binary of ANTLR tool’ for .NET/Mono under ANTLR Tool Binaries
- Download ‘Binary of ANTLR tool’ for .NET/Mono under ANTLR Tool Binaries
- ANTLR runtime for .NET (from same page as above)
- Download the ANTLR 3.0.1 source distribution
- The runtime is under
runtime\CSharp\bin\net-2.0\after you unpack it
- ANTLRWorks (optional, but highly recommended; requires Java)
Startup Visual Studio and create a new C# Console Application project, I called mine LolSharp. I do a File > Save All to ensure that the directory structure gets created. Under the project directory, create two subdirectories named tools and lib. Under tools I created another directory called antlr and put the ANTLR build tool in there. It should contain the following files:
- Antlr3.Tool.exe
- IKVM.GNU.Classpath.dll
- IKVM.Runtime.dll
Under the lib directory, I created antlr-csharp-runtime. From the ANTLR source distribution (see above) it contains:
- antlr.runtime.dll
- Antlr3.Runtime.dll
- Antlr3.Utility.dll
- StringTemplate.dll
Ok, groovy. You can go ahead and delete the Program.cs file, or rename it to LolCompiler.cs. Next, we add references to all of the files under lib/antlr-csharp-runtime by right-clicking the References folder, selecting Add Reference... and choosing the Browse tab. Now select Project > LolSharp Properties... and choose the Build Events tab on the side. For the Pre-build event command line enter:
cmd && cd ..\.. && .\tools\antlr\Antlr3.Tool LolSharp.g LolSharpGen.g && exit
This will run the ANTLR code generator on the files LolSharp.g and LolSharpGen.g before compiling the project. This ensures that the generated code associated with the grammars we’ll be writing is always up to date. There is probably a better way to integrate this into VS, but I’m not sure how to do it. If you have any suggestions, leave a comment and let me know. If you want to build the test application executable from Test.lol automatically, put the following in the Post-build event command line:
LolSharp Test.lol
Next, switch to the Debug tab on the left. When we test our compiler, we want it to run our default Test.lol program. Set the Command line arguments as:
Test.lol --interpret
Also, set the Working directory to ..\..\ then go ahead and close the project properties.
I Can Has Gramr?
Now it’s time to get to the fun stuff, writing the ANTLR grammars to parse our language. If you’ve played around with this sort of thing before you might find my using grammars in the plural a bit odd. Basically, the first grammar (LolSharp.g) takes the input and creates a parse tree from it. The second grammar (LolSharpGen.g) is what ANTLR calls a tree grammar; it walks over a tree (provided in this case by the first grammar) and either transforms it or produces some output. We’ll get into the details of the tree grammar soon. First though, let’s pick apart LolSharp.g. You’re going to want to add this file to the C# project.
grammar LolSharp;
options
{
language=CSharp;
output=AST;
ASTLabelType=CommonTree;
}
tokens
{
PROGRAM;
USING;
PRINT;
STRINGLITERAL;
USINGDECLS;
}
program : 'HAI' '\n' (using_decl '\n')* (statement '\n')+ 'KTHXBYE' '\n'? -> ^(PROGRAM ^(USINGDECLS using_decl*) statement+)
;
statement
: print_statement
;
using_decl
: 'I CAN HAS' ID '?' -> ^(USING ID)
;
print_statement
: 'VISIBLE' literal_string -> ^(PRINT literal_string)
;
literal_string
: RAW_QUOTED_STRING -> ^(STRINGLITERAL RAW_QUOTED_STRING)
;
ID : LETTER (LETTER | DIGIT)*
;
LETTER
: 'a'..'z' | 'A'..'Z'
;
DIGIT : '0'..'9'
;
WS : (' ' | '\t' | '\n' | '\r') { $channel=HIDDEN; }
;
RAW_QUOTED_STRING
: '"' ( ('\\' ESCAPE_CODE) | ~( '\\' | '\n' | '"'))* '"'
;
ESCAPE_CODE
: 'n'
| 't'
| '\\'
| '"'
| DIGIT DIGIT DIGIT DIGIT
;
The file begins with the declaration grammar LolSharp; which simply identifies the type of grammar and what it is called. Following a Java convention, the name of the file must match the name of the grammar plus the .g extension. Next up is the options section:
options
{
language=CSharp;
output=AST;
ASTLabelType=CommonTree;
}
The language property sets the output language for the generated parser, in this case C#. The default is Java and a number of languages, including Python and C, are supported. The line output=AST tells the ANTLR tool that our parser will generate an abstract syntax tree. The third property tells the generator that the type for a node in our AST is CommonTree, a default-yet-useful type provided by the ANTLR runtime. Next up we declare some tokens for ‘virtual’ nodes of our AST. These tokens never actually appear in the input stream, but will get consumed by the tree grammar we create next. I’ll explain these when we talk about constructing the output tree below. After that, it’s time for the productions. A production starts with a rule name, for example program, followed by a colon. After that, each alternative for the rule is given, seperated by a pipe character: |. The alternatives for a rule are written in Extended Backus–Naur form (EBNF), which sounds complicated but it’s pretty simple. Basically, you specify a sequence of other rules and you can use ?, *, or + after each item to indicate optional, zero-or-more, or one-or-more cardinalities. Unlike most other parser generators, ANTLR includes the specification for lexical analysis right in with that for the parser. So, take a look at the production for program:
program : 'HAI' '\n' (using_decl '\n')* (statement '\n')+ 'KTHXBYE' '\n'? ;
The name of the rule is program, and there’s only one way to match it. First, you need the string 'HAI' and then the string '\n', which represents a newline character. We use newlines to delimit statements because LOLCODE doesn’t have an explicit delimiter like the semi-colon in C-style languages. This is one of the interesting things about parsing LOLCODE and it’s actually more complicated than this, but this simple rule will do for now. Next is the (using_decl '\n')* part. This means that we need to match a using_decl, which is another rule, followed by a new-line. The star means we can match this sequence zero or more times. This corresponds to the I CAN HAS x? declarations in our program, which are like using declarations in C#. We then require statement productions separated by newlines. The plus indicates that we have to include at least one statement, so our compiler will reject a program that doesn’t do anything. Finally, we look for the string 'KTHXBYE' followed by an optional newline, signified by the question mark.
The rest of the productions should be simple enough to figure out and there are plenty of references for EBNF on the internet. I cut out a key part of the production above though:
-> ^(PROGRAM ^(USINGDECLS using_decl*) statement+)
This is a tree rewrite rule. You can perform actions when matching a rule by including code between curly braces (see the WS rule for an example). However, building an AST is such a common task that there’s some special syntax for it. The right-arrow operator (->) indicates that we’re using a rewrite rule. The caret (^) is called the ’subtree root operator’. Yeah, that’s a mouthful, but all it does is make a tree out of the items in the list it prefixes by taking the first item and making it the root of the tree. So in the rule above, you’d end up with a tree that had a PROGRAM node at the root with a USINGDECLS child and a child for each statement. The USINGDECLS node also has a child for each using_decl rule matched by the program rule. If you’ll notice, PROGRAM and USINGDECLS aren’t rules in our grammar, but they are defined in the tokens section near the top of the file. We create these nodes to more clearly express the AST and to make it simpler to use when we actually generate output.
If you’re having some trouble wrapping your head around all of this: open up LolSharp.g in ANTLRWorks, comment out the language=CSharp; line and run it in ANTLRWorks’ debugger with some code like the example program near the top of this page. You can step through the parsing of the text and see how it generates the final AST.
So Wait, What’s the Point Again?
So now that we have this spiffy AST, what do we do with it? First, let’s define what the goals are. A proper compiler directly emits bytecode as the final product. Getting to that point with this project seems like overkill, however emitting assembly code and running it through an assembler (ilasm, in our case) seems like a reasonable goal. However, at this stage of the game we’re just going to emit C# and use Microsoft’s CodeDom API to turn it into an executable. Don’t worry, we’ll do it in assembly eventually. Promise. So, taking an input file like:
Listing – Test.lol
HAI I CAN HAS STDIO? VISIBLE "HAI WORLD!" KTHXBYE
We’d want to output something like:
Listing – Test.lol.cs
//using STDIO; - NYI
public class LolSharpProgram
{
public static void Main(string[] args)
{
System.Console.WriteLine("HAI WORLD!");
}
}
Enter: tree grammars and StringTemplate. ST is a library that integrates with ANTLR (they are written by the same guy) which allows you to easily create templates and populate them with data. Create a file called CSharpTarget.stg and add it to the project. Make sure Copy to Output Directory is set to Copy always in the file’s VS properties pane. This file will hold our StringTemplate definitions that we used to generate our C# code. The cool thing is that we could modify this file slightly and target a new language. So, without actually modifying our parser or code generator, we could emit code in Java, Python, Perl, or whatever.
Listing – CSharpTarget.stg
group CSharpTarget;
using(id) ::= <<
//using <id>; - NYI<\n>
>>
echo_string(str) ::= <<
<str>
>>
print(txt) ::= <<
System.Console.WriteLine(<txt>);<\n>
>>
program(import, statement) ::= <<
<import>
public class LolSharpProgram
{
public static void Main(string[] args)
{
<statement>
}
}
>>
The first line works just like the grammar line in LolCode.g. Each template has a few parts. On the left side of the ::= operator you define the name of the template and any arguments it takes. On the right side you specify the output the rule generates; that’s everything between the << and >> symbols. You could also just specify a quoted string, but for larger templates this way works better and I like to be consistent. If you look at the definition for the print template you see two replacement points: <txt> and <\n>. The first subsitutes the value of the txt argument and the latter inserts an extra newline (whitespace at the beginning and end of a template is trimmed normally). If you look, you’ll see that we can build up everything in Test.lol.cs with these templates. That’s what we’re going to do in the tree grammar. Add LolSharpGen.g to your project.
Listing – LolSharpGen.g
tree grammar LolSharpGen;
options
{
language=CSharp;
output=template;
tokenVocab=LolSharp;
ASTLabelType=CommonTree;
}
program : ^(PROGRAM
^(USINGDECLS (imports+=using_decl)*)
(stmts+=statement)+)
-> program(import={$imports}, statement={$stmts})
;
statement
: print_statement -> {$print_statement.st}
;
using_decl
: ^(USING ID) -> using(id={$ID})
;
print_statement
: ^(PRINT literal_string) -> print(txt={$literal_string.st})
;
literal_string
: ^(STRINGLITERAL RAW_QUOTED_STRING) -> echo_string(str={$RAW_QUOTED_STRING.text})
;
The tree grammar works in a manner very similar to our standard grammar. Note first that a couple of options have changed. Our output type is now template to indicate that we’ll be generating StringTemplate objects. There’s also the new option tokenVocab=LolSharp; which instructs the generator to share the set of tokens we defined in our LolSharp grammar. The rules of our tree grammar match sub-trees instead of sequences of tokens. Behind the scenes they actually work the same way, but the tree grammar uses the special tokens DOWN and UP which indicate a change in depth in the tree. The right arrow operator now indicates what StringTemplate object to build and how to parameterize it for a given matched subtree. You’ll notice that the bits between the curly braces are evaluated to parameterize the templates. These are actions written in C# with some special added magic. Things on the left hand side of the -> operator are referred to by prefixing their name with a dollar sign. The fields text and st are defined on the nodes of the AST. The text field correlates to some text in the input stream while the st field stores the StringTemplate instance for that node. So for example, when the tree parser sees a PRINT node with a literal_string child, it evaluates to the print template with a txt parameter equal to the template of the literal_string child. In the case of our sample program that will generate:
System.Console.WriteLine("HAI WORLD!");
Tying it All Together
At this point, you should build the project in Visual Studio. This will generate three files: LolSharpLexer.cs, LolSharpParser.cs, and LolSharpGen.cs. Add these to the project. Next, add Test.lol, if you have not done so already, containing the sample program we’ve been using. Make sure that the Copy to Output Directory property is set to Copy always for this file, just like for CSharpTarget.stg. Finally, put the following code into LolCompiler.cs (and add that to the project if you need to).
// System stuff
using System;
using System.IO;
using System.Reflection;
// ANTLR stuff
using TokenStreamException = antlr.TokenStreamException;
using Antlr.Runtime;
using Antlr.Runtime.Tree;
using Antlr.StringTemplate;
// C# CodeGen Stuff
using Compiler = System.CodeDom.Compiler;
using Microsoft.CSharp;
namespace LolSharp
{
class LolCompiler
{
//the command line looks like: <input-filename> [--interpret]
//the --interpret flag causes immediate execution in lieu of generating an executable
static void Main(string[] args)
{
string filename = args[0];
string programName = filename.Substring(0, filename.LastIndexOf('.'));
//TODO - use a more robust command line argument mechanism
bool inMemory = (args.Length > 1) && ("--interpret".StartsWith(args[1], StringComparison.OrdinalIgnoreCase));
try
{
//CSharpTarget.stg contains string templates used for targeting C#
TextReader stgReader = new StreamReader(File.Open("CSharpTarget.stg", FileMode.Open));
StringTemplateGroup stg = new StringTemplateGroup(stgReader);
//lex
ICharStream input = new ANTLRFileStream(filename);
LolSharpLexer lex = new LolSharpLexer(input);
//parse
LolSharpParser parse = new LolSharpParser(new CommonTokenStream(lex));
LolSharpParser.program_return parseRet = parse.program();
//generate C#
ITreeNodeStream nodeStream = new CommonTreeNodeStream(parseRet.Tree);
LolSharpGen gen = new LolSharpGen(nodeStream);
gen.TemplateLib = stg;
LolSharpGen.program_return ret = gen.program();
//compile to executable and optionally run
Assembly assembly = Generate(ret.Template.ToString(), programName, inMemory);
if (inMemory)
{
assembly.EntryPoint.Invoke(null, new object[] { new string[] { } });
}
}
catch (TokenStreamException e)
{
Console.Error.WriteLine("exception: " + e);
}
catch (RecognitionException e)
{
Console.Error.WriteLine("exception: " + e);
}
}
static Assembly Generate(string code, string name, bool inMemory)
{
//TODO - allow specification of output dir
string outputDir = Directory.GetCurrentDirectory();
//dump the generated code
using (FileStream fs = File.OpenWrite(Path.Combine(outputDir, name + ".lol.cs")))
{
StreamWriter writer = new StreamWriter(fs);
writer.Write(code);
writer.Flush();
writer.Close();
}
Compiler.CompilerParameters compilerParams = new Compiler.CompilerParameters();
compilerParams.GenerateInMemory = inMemory;
if (!inMemory)
{
compilerParams.OutputAssembly = Path.Combine(outputDir, name + ".exe");
}
compilerParams.TreatWarningsAsErrors = false;
compilerParams.CompilerOptions = "/optimize";
compilerParams.MainClass = "LolSharpProgram";
compilerParams.GenerateExecutable = true;
CSharpCodeProvider provider = new CSharpCodeProvider();
Compiler.CompilerResults results = provider.CompileAssemblyFromSource(compilerParams, code);
if (results.Errors.HasErrors)
{
throw new Exception("Internal compiler error."); //this means a bug in our compiler, not the user's code
}
return results.CompiledAssembly;
}
}
}
All of the stuff we’re interested in falls inside the try block of the Main method. First, we create a StringTemplateGroup from our CSharpTarget.stg file. If we wanted to re-target our code generation to a different language, we could simply open a different file here. I think that’s pretty awesome, but it’s also 2:15 a.m. so I might just be going mad. The next two sections setup the lexer and the parser then run them:
//lex ICharStream input = new ANTLRFileStream(filename); LolSharpLexer lex = new LolSharpLexer(input); //parse LolSharpParser parse = new LolSharpParser(new CommonTokenStream(lex)); LolSharpParser.program_return parseRet = parse.program();
We call the program() method of our parser to start parsing at that rule, there’s a method declared for each rule in our LolSharp.g grammar. Next, we create a CommonTreeNodeStream from the Tree property of the object returned by the parser. WTF does this mean? Basically the CommonTreeNodeStream walks the AST we generated and feeds each node to the tree parser as a token it understands. When it travels down to a child node, it also emits that special DOWN token we talked about. Likewise it emits an UP when returning to a parent node. Once we have that, we create an instance of our code generator (the tree grammar parser) and set it’s TemplateLib property to the StringTemplateGroup we created earlier. We then call it’s program() method to start at that rule. The return value has a Template property which, when turned into a string, evaluates to the C# code we’ve been trying to generate.
The End of the Beginning
After we have the code, it’s just a matter of piping it through the CodeDom API with all the right parameters. This stuff is pretty well documented and fairly straightforward, so I’ll let you pick it apart for yourself. If you have any questions feel free to shoot them my way. Assuming you’ve followed the instructions in this article, and that I didn’t louse it up and forget something, you should be able to hit F5 (or Ctrl+F5 if you don’t want the window to close right away…) and see HAI WORLD! output to the terminal. Yay!
Ok, so it’s not that bloody impressive. I promise that in the next installment we’ll actually get into the fun bits and implements some useful language features. As it stands though, we have a convenient environment to build and test our compiler. Stay tuned.
Enjoy this article?
Leave a comment
No trackbacks yet.
4:16 pm on July 20th, 2009
I can’t believe you used antlr instead of mgrammar. Ouch.