I have been writing a scheme interpreter in c, and for me the most interesting aspect so far has been the level at which I am programming.
At the beginning it was very traditional c; symbol and AST manipulation were a PITA (at that point it was a malloc'd arrays of `expressions`).
After I had a base language working I started to use the language I had implemented so far to further the implementation, this finally peaked where this weekend I did a large refactor to remove most of my c arrays and instead replace them with scheme pairs and lists.
For example, here [1] I implement define function form in terms of lambda, specifically new_lambda(env, cons(args, cons(body, null))).
In hindsight this seems so obvious, but I have found the whole process extremely interesting, specifically looking at how the implemented languages starts to influence the implementation language.
I really cannot stress enough how enjoyable the process of writing my interpreter has been, I thoroughly recommend it to anyone who is interesting in programming languages.
I have been working on a small r7rs scheme interperter [1] and it has been extremely educational. I had never used scheme previously but learning it from sicp [2] has been rather enlightening, it surprisingly lived up to expectations.
I haven't considered if I will go down the compiler route, but either way that resource looks very interesting; especially the coverage on tail call optimisation at the assembly level.
Does anyone have any further resources discussing tail call optimisation? I would like something a bit more in depth.
The other topic I would like to read more on is continuations, as a still newbie-schemer I find the idea of implementing them to be quite daunting.
The best way to deal with tail call "optimization" is to not treat it as an optimization at all. Tail calls are a syntactic property of source code. You can easily determine whether a given call is in a tail position by recursively traversing the AST. Then you just have to generate tail calls in a way that uses constant stack space. This is mostly a matter of properly designing the calling convention. See: www.complang.tuwien.ac.at/schani/diplarb.ps.
I like that way of thinking of it, I see it as just reusing the existing stack frame as a call in a tail call position doesn't need anything from the current stack (returning to this frame adds no extra meaning to it).
Having never implemented them though I feel like my ideas still need some fleshing out, my naive approach seems to be similar to the idea of trampolining.
Really looking forward to reading the linked document, very thorough and looks to be just what I was looking for (it even covers architecture dependent aspects!).
If you've got the freedom to design your calling convention, you don't need trampolines or anything like that.
The basic idea is to compile every call in a tail position to a jump. You just need to clean up your stack before making the call, and make sure to arrange your calling convention such that A can call B, which can jump to C, which can return directly to A.
In the A -> B -> C example, this means:
1) Using a callee-pops convention so that callees pop arguments off the stack rather than callers. If B and C take different numbers of arguments, A doesn't know how many arguments to pop. So functions should pop their arguments off the stack themselves before they return.
2) B must restore its callee-save registers before jumping to C, which will save those registers itself if it clobbers them.
3) To accommodate variadic functions, there has to be a hidden argument to let callees know how many arguments were actually pushed by the caller, so they can pop the right number.
Thanks, that document looks very impressive and will now sit at the top of my reading list :)
Oddly following that link gives me a 404 ('The requested URL /~dyb/papers/3imp.pdf‎ was not found on this server.', seems there is a trailing character '%E2%80%8E').
And depending on the style of compiler you're writing, it is not a given that you'll do much symbolic manipulation of data structures. E.g. a Wirth style compiler pretty much only maintains a very simple, light-weight symbol table.
(My first proper compiler started out in M68000 assembler, btw.; the amount of symbolic manipulation of data structures was more than small enough that this was not all that much of a challenge. Once the language started taking shape, one of the first things I added was support for inline assembly, and then I gradually started using the language constructs to rewrite the compiler in the language I wrote; implementation language really is not a huge deal)
I'm going to offer an alternate suggestion. If you want to learn how to program compilers, or any other new domain for that matter, start in a language you're already comfortable with. That way, you're only learning one new thing at a time.
While true, depending on the language it can be quite be quite an interesting experience, depending on well well the language supports data manipulation.
For example, I have some books about writing Pascal compilers in 48K ZX Spectrum Basic.
Quite an interesting read about code optimization.
Another tip is trying to find a generic programming library to get rid of some of the boilerplate tree traversal you will end up doing. Dunno about the Ocaml ecosystem but for Haskell you could use something like Uniplate or SYB.
Most "tutorials" I've seen as of late (including this one) seem to walk through creating the grammar, and then just hand wave the actual creation of the AST (and the rest of the steps) to "yacc magic" or some other friends.
I would not call that building a compiler.
Are there any modern tutorials/references of hand-generating the grammar, hand-coding the parser, and hand-coding whatever comes next (because I have no idea, thanks to these new-age tutorials) - without a toolchain, so that the entire process can be seen from start to finish?
Writing a parser isn't a terribly interesting or difficult part of writing a compiler. You can describe a simple one in a page of pseudocode. If you really are interested in writing a compiler, I wouldn't get hung up on this point.
If you want more depth, you probably want to pick up a compiler textbook. I liked "Modern Compiler Implementation" (I've used both the C and ML versions) in my undergrad.
I've been thinking about what I've said in this post and I want to reword slightly. Writing a parser can be very interesting, and there's a lot of neat techniques, and since we're talking about learning here and not making production compilers, it's certainly a worthwhile endeavor.
What I really mean to say is that your parser doesn't have a lot of effect on the rest of your compiler design. In the end, it's a function that takes input text to an AST. You can go back and replace a YACC generated one later, if you want to know more.
Of course if you choose a simple to parse input language (like lisp) then you can write a simple handwritten parser off the bat.
There is not much point in writing a parser without using a parser generator. Now, if you want to know how to make a parser generator, I recommend starting here:
Also, this wonderful page too: http://compilers.iecc.com/crenshaw/
It exactly what you're looking for. No automated tool, library etc; In fact, you will need to have a automated tool: the Pascal compiler for compile your compiler. But I guess it is not a problem, right? :p the tutorial is really nice I’m at tutor6 right now and the guy really teaches step-by-step each little thing about we are going to working on. The toturial is unfinished but don’t worry, you can learn a lot with the contents he has provided so far. (sorry english, not my native language)
This was some of my motivation for my series, though it needs a rewrite/tightening up some day if/when I complete it. I started with code-generation, and worked my way up from that, partly because I've always been very annoyed that so many compiler texts are so focused on parser theory (especially because most production compilers ends up with hand-written recursive descent parsers because they are easy to write and easy to add error handling to, unlike most of the ones generated with fancy tools, yet lots of texts insists on trying to teach tons of parsing theory that most people will never, ever use before touching on any of the other parts).
(I just finished part 34 last night, and the next part goes up on my site in a week or so; I'll shorten my buffer quite a bit over the next two months, I think - I currently have a lead time of about 5 months... )
It starts out fairly generic in terms of code generation, but I then got the stupid or brilliant (you choose) idea of writing a Ruby compiler, which on one hand is very interesting, on the other hand is exceedingly frustrating, and which I think lost me some focus - I really should extract the simple/non-Ruby specific sections into a separate text and clean it up.
In terms of other peoples series, look up Crenshaws "Let's write a compiler" and Niklaus Wirths "Compiler Construction". Bother are available for free online. Frankly, if you google 'How to write a compiler' the results are actually very useful (both books show up on the first page for me).
I would also recommend looking at the code generated by Lex and Yacc as well as hand-written code to parse a Lisp or Scheme. That might give an idea of how to organize a hand-written parser.
Though, without reading Modern Compiler Implementation or something similar the use of shift-reduce tables in generated code might not make sense. Also, my computer science curriculum included a course where we wrote a top-down LL(1) parser which is organized very differently with a function call for parsing each rule which calls other functions for parsing sub-rules etc.
I've hand-written a few mangled char-by-char state machines as well, which served as lexers for various DSLs.
The reason you see people using yacc instead of hand coding it is that its really annoying to hand-write a bottom up parser for an LR grammar. If you are going to be writing things by hand then its more intuitive to use a top-down recursive-descent parser instead. However, top-down parsing requires more "tricks" to parse usual languages (for example, you need to rewrite the grammar to an equivalent one without left-recursion) so most introductions you see use bottom-up parsing instead.
For an advanced version: The C++ Grandmaster Certification MOOC where participants build their own complete C++11 compiler in C++11 is still going strong:
http://cppgm.org/
My favorite part is this sentence: "The rules for constructing virtual tables of the class are a combination of the rules from Categories 2 and 3, and can generally be determined inductively."
Ah yes, very comforting. The rules can be generally determined inductively (except of course, when they can't be!)
Well, I would extend that and ask. Has anybody completed implementing the Lexer/Parser phase? Once you get the Syntax tree or IR, the complexity is more or less (if not exactly the same) similar to other languages.
WalterBright[0] wrote the first native code C++ compiler, Zortech C++, (later called Symantec C++ and recently DigitalMars C++) and is the man behind the D programming language.
Mr. Bright, you're my hero. Really. Not only because you have made a C++ compiler using recusive descent parser and it's was the first ever to generate native code and the first comercial too and you don't even do generate assembly on dmd. You still made the D language and dmd compiler. A world saving tool from C and C++. It will rock the world. Could you imagine one of the next world's famous software like Windows or *BSD written in D? Also, I'm on the way on compiler business too (joke :-)). (sorry english, not my native language)
Nitpicking certainly but I would say use something like antlr rather than peg.js
antlr "schemas" are extremely close to ebnf's (some exceptions obviously) and the compiler creation club could benefit from that.
Antlr 3 is usable in a variaty of languages (although 4 is a hellavalot simpler and solves a fair bit around left recursive grammers, but non jvm support is non existent still I suspect)
I started writing my own compile-to-JS language a while back after reading this book: http://createyourproglang.com/ and I'm loving it!
However I'm having a bit of trouble right now with my grammar, I'm using Jison (http://jison.org) and the error messages are kind of confusing (I didn't even know they were error messages at first, I thought they were some kind of logging). I apparently have shift-reduce conflicts just about everywhere, but I didn't bother solving them as I was building the rest of the language since everything is working fine! (well I'm sure there are edge cases that I haven't ran into yet)
So if anyone here has a good link on the basics of parsing, grammars, LR(1) and whatnot, I'd like to understand what I'm doing ;)
I am currently following a compiler class, and I must say that I am really amazed and impressed. For an assignment, we of course had to write an interpreter for our own mini language. The fealing you have after creating this interpreter was overwhelming.
For the lexer and parser, we used SableCC, an object oriented framework that generates a compiler in Java. I've never used anything else (yacc, lex, etc), so I can't compare the tool, but it provides a rich, useful and easy interface to use.
If only SableCC could compile its Java 1.7 grammar without blowing up in memory usage, I would love it even more. As it is, I used it in my research when I needed a parser. Can't recommend it enough!
kids these days, when I wrote my first compiler I had to write my own yacc equivalent, and parser
FYI: My generator allowed for dynamic resolution of shift/reduce conflicts which allows you to compile some languages yacc wont let you (languages that let you change the priority of operators for example)
Did something similar in 6800 assembly, left recursion of course limits you even more - but it helps fit that tiny tiny compiler into 2k.
I actually like yacc/bison - they're good for most purposes and deliberately designing a non-LALR (or more a non-LR) language on purpose (rather than because you don't know any better) is usually silly - you do need to 'get' the concept of building a parse tree from bottom up - assembling it from larger and larger snippets as you go
Oh and yacc/bison run about 10 times faster than equivalent I wrote 10 years before they existed so I'm not complaining
I tend to use the same bespoke lexical analyser I've used for years and hack it to suit - it includes support for symbol tables/etc and runs really fast, no need to reinvent the wheel
You can also learn a lot by studying (and even contributing to) other open source compilers. Here are about 100 javascript-based compilers, including tools for compiler writers at the bottom: https://github.com/jashkenas/coffee-script/wiki/List-of-lang...
At the beginning it was very traditional c; symbol and AST manipulation were a PITA (at that point it was a malloc'd arrays of `expressions`). After I had a base language working I started to use the language I had implemented so far to further the implementation, this finally peaked where this weekend I did a large refactor to remove most of my c arrays and instead replace them with scheme pairs and lists.
For example, here [1] I implement define function form in terms of lambda, specifically new_lambda(env, cons(args, cons(body, null))).
In hindsight this seems so obvious, but I have found the whole process extremely interesting, specifically looking at how the implemented languages starts to influence the implementation language.
I really cannot stress enough how enjoyable the process of writing my interpreter has been, I thoroughly recommend it to anyone who is interesting in programming languages.
[1] https://github.com/mkfifo/plot/commit/07272bd69e51979ab71fa0...