Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Join the Compiler Creation Club (tech.pro)
93 points by skiskilo on Nov 25, 2013 | hide | past | favorite | 56 comments


I have been writing a scheme interpreter in c, and for me the most interesting aspect so far has been the level at which I am programming.

At the beginning it was very traditional c; symbol and AST manipulation were a PITA (at that point it was a malloc'd arrays of `expressions`). After I had a base language working I started to use the language I had implemented so far to further the implementation, this finally peaked where this weekend I did a large refactor to remove most of my c arrays and instead replace them with scheme pairs and lists.

For example, here [1] I implement define function form in terms of lambda, specifically new_lambda(env, cons(args, cons(body, null))).

In hindsight this seems so obvious, but I have found the whole process extremely interesting, specifically looking at how the implemented languages starts to influence the implementation language.

I really cannot stress enough how enjoyable the process of writing my interpreter has been, I thoroughly recommend it to anyone who is interesting in programming languages.

[1] https://github.com/mkfifo/plot/commit/07272bd69e51979ab71fa0...


Advice to anyone jumping into writing compilers, just pick a functional language, specially ML family.

Symbolic manipulation of data structures is plain joy, compared to what is required in C or Pascal family of languages.


My recommendation would be to compile a Lisp. See: http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf. Easy to follow tutorial for compiling Scheme to x86.

This advice is especially true if you think syntax and parsing are the boring parts of writing a compiler.


I have been working on a small r7rs scheme interperter [1] and it has been extremely educational. I had never used scheme previously but learning it from sicp [2] has been rather enlightening, it surprisingly lived up to expectations.

I haven't considered if I will go down the compiler route, but either way that resource looks very interesting; especially the coverage on tail call optimisation at the assembly level.

Does anyone have any further resources discussing tail call optimisation? I would like something a bit more in depth.

The other topic I would like to read more on is continuations, as a still newbie-schemer I find the idea of implementing them to be quite daunting.

[1] https://github.com/mkfifo/plot [2] https://mitpress.mit.edu/sicp/


The best way to deal with tail call "optimization" is to not treat it as an optimization at all. Tail calls are a syntactic property of source code. You can easily determine whether a given call is in a tail position by recursively traversing the AST. Then you just have to generate tail calls in a way that uses constant stack space. This is mostly a matter of properly designing the calling convention. See: www.complang.tuwien.ac.at/schani/diplarb.ps‎.


Thanks for those words and that link.

I like that way of thinking of it, I see it as just reusing the existing stack frame as a call in a tail call position doesn't need anything from the current stack (returning to this frame adds no extra meaning to it).

Having never implemented them though I feel like my ideas still need some fleshing out, my naive approach seems to be similar to the idea of trampolining.

Really looking forward to reading the linked document, very thorough and looks to be just what I was looking for (it even covers architecture dependent aspects!).


If you've got the freedom to design your calling convention, you don't need trampolines or anything like that.

The basic idea is to compile every call in a tail position to a jump. You just need to clean up your stack before making the call, and make sure to arrange your calling convention such that A can call B, which can jump to C, which can return directly to A.

In the A -> B -> C example, this means:

1) Using a callee-pops convention so that callees pop arguments off the stack rather than callers. If B and C take different numbers of arguments, A doesn't know how many arguments to pop. So functions should pop their arguments off the stack themselves before they return.

2) B must restore its callee-save registers before jumping to C, which will save those registers itself if it clobbers them.

3) To accommodate variadic functions, there has to be a hidden argument to let callees know how many arguments were actually pushed by the caller, so they can pop the right number.


Good old 3imp might be interesting to you, http://www.cs.indiana.edu/~dyb/papers/3imp.pdf

Dybvig himself, you might recognize him as the author of the Scheme specification.


Thanks, that document looks very impressive and will now sit at the top of my reading list :)

Oddly following that link gives me a 404 ('The requested URL /~dyb/papers/3imp.pdf‎ was not found on this server.', seems there is a trailing character '%E2%80%8E').

I was able to find it on google and open it, yielding the link http://www.cs.indiana.edu/~dyb/papers/3imp.pdf


Oh weird. sorry about that. Thanks for fixing it.


Its more geared towards ML but I found Appel's Compiling with Continuations very interesting.


Yes, that tutorial is quite good.


This! Check out MiniML http://andrej.com/plzoo/html/miniml.html

It's tiny and way more powerful than yet another C dialect.


If you like functional languages.

And depending on the style of compiler you're writing, it is not a given that you'll do much symbolic manipulation of data structures. E.g. a Wirth style compiler pretty much only maintains a very simple, light-weight symbol table.

(My first proper compiler started out in M68000 assembler, btw.; the amount of symbolic manipulation of data structures was more than small enough that this was not all that much of a challenge. Once the language started taking shape, one of the first things I added was support for inline assembly, and then I gradually started using the language constructs to rewrite the compiler in the language I wrote; implementation language really is not a huge deal)


I'm going to offer an alternate suggestion. If you want to learn how to program compilers, or any other new domain for that matter, start in a language you're already comfortable with. That way, you're only learning one new thing at a time.


While true, depending on the language it can be quite be quite an interesting experience, depending on well well the language supports data manipulation.

For example, I have some books about writing Pascal compilers in 48K ZX Spectrum Basic.

Quite an interesting read about code optimization.


Another tip is trying to find a generic programming library to get rid of some of the boilerplate tree traversal you will end up doing. Dunno about the Ocaml ecosystem but for Haskell you could use something like Uniplate or SYB.


Most "tutorials" I've seen as of late (including this one) seem to walk through creating the grammar, and then just hand wave the actual creation of the AST (and the rest of the steps) to "yacc magic" or some other friends.

I would not call that building a compiler.

Are there any modern tutorials/references of hand-generating the grammar, hand-coding the parser, and hand-coding whatever comes next (because I have no idea, thanks to these new-age tutorials) - without a toolchain, so that the entire process can be seen from start to finish?


Writing a parser isn't a terribly interesting or difficult part of writing a compiler. You can describe a simple one in a page of pseudocode. If you really are interested in writing a compiler, I wouldn't get hung up on this point.

If you want more depth, you probably want to pick up a compiler textbook. I liked "Modern Compiler Implementation" (I've used both the C and ML versions) in my undergrad.


I've been thinking about what I've said in this post and I want to reword slightly. Writing a parser can be very interesting, and there's a lot of neat techniques, and since we're talking about learning here and not making production compilers, it's certainly a worthwhile endeavor.

What I really mean to say is that your parser doesn't have a lot of effect on the rest of your compiler design. In the end, it's a function that takes input text to an AST. You can go back and replace a YACC generated one later, if you want to know more.

Of course if you choose a simple to parse input language (like lisp) then you can write a simple handwritten parser off the bat.


Thank you for the recommendation!


There is not much point in writing a parser without using a parser generator. Now, if you want to know how to make a parser generator, I recommend starting here:

https://en.wikipedia.org/wiki/Shift-reduce_parsing

https://en.wikipedia.org/wiki/LALR_parser

If you want lots of detail, there is always this:

http://www.amazon.com/Compilers-Principles-Techniques-Tools-...


I couldn't disagree more. Parser generators are ok for trivial grammars with limited error handling. They're generally horrible for writing compilers.


Also, this wonderful page too: http://compilers.iecc.com/crenshaw/ It exactly what you're looking for. No automated tool, library etc; In fact, you will need to have a automated tool: the Pascal compiler for compile your compiler. But I guess it is not a problem, right? :p the tutorial is really nice I’m at tutor6 right now and the guy really teaches step-by-step each little thing about we are going to working on. The toturial is unfinished but don’t worry, you can learn a lot with the contents he has provided so far. (sorry english, not my native language)


I would advise "Compiler Construction" from Niklaus Wirth book over that tutorial.

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

Of course, the best is to do both. :)


Thanks very much for that book. Of course I'm going to read it too. :)


This was some of my motivation for my series, though it needs a rewrite/tightening up some day if/when I complete it. I started with code-generation, and worked my way up from that, partly because I've always been very annoyed that so many compiler texts are so focused on parser theory (especially because most production compilers ends up with hand-written recursive descent parsers because they are easy to write and easy to add error handling to, unlike most of the ones generated with fancy tools, yet lots of texts insists on trying to teach tons of parsing theory that most people will never, ever use before touching on any of the other parts).

My series is here: http://www.hokstad.com/compiler/

(I just finished part 34 last night, and the next part goes up on my site in a week or so; I'll shorten my buffer quite a bit over the next two months, I think - I currently have a lead time of about 5 months... )

It starts out fairly generic in terms of code generation, but I then got the stupid or brilliant (you choose) idea of writing a Ruby compiler, which on one hand is very interesting, on the other hand is exceedingly frustrating, and which I think lost me some focus - I really should extract the simple/non-Ruby specific sections into a separate text and clean it up.

In terms of other peoples series, look up Crenshaws "Let's write a compiler" and Niklaus Wirths "Compiler Construction". Bother are available for free online. Frankly, if you google 'How to write a compiler' the results are actually very useful (both books show up on the first page for me).


I would also recommend looking at the code generated by Lex and Yacc as well as hand-written code to parse a Lisp or Scheme. That might give an idea of how to organize a hand-written parser.

Though, without reading Modern Compiler Implementation or something similar the use of shift-reduce tables in generated code might not make sense. Also, my computer science curriculum included a course where we wrote a top-down LL(1) parser which is organized very differently with a function call for parsing each rule which calls other functions for parsing sub-rules etc.

I've hand-written a few mangled char-by-char state machines as well, which served as lexers for various DSLs.


The reason you see people using yacc instead of hand coding it is that its really annoying to hand-write a bottom up parser for an LR grammar. If you are going to be writing things by hand then its more intuitive to use a top-down recursive-descent parser instead. However, top-down parsing requires more "tricks" to parse usual languages (for example, you need to rewrite the grammar to an equivalent one without left-recursion) so most introductions you see use bottom-up parsing instead.



For an advanced version: The C++ Grandmaster Certification MOOC where participants build their own complete C++11 compiler in C++11 is still going strong: http://cppgm.org/


Writing a C++ compiler is something I wouldn't wish on my worst enemy.

My Google SoC project (ages ago) involved generating code to be ABI compatible with C++ virtual table layouts. This spec gave me nightmares: http://mentorembedded.github.io/cxx-abi/abi.html#vtable.

My favorite part is this sentence: "The rules for constructing virtual tables of the class are a combination of the rules from Categories 2 and 3, and can generally be determined inductively."

Ah yes, very comforting. The rules can be generally determined inductively (except of course, when they can't be!)


Has literally anyone actually completed that certification?


Well, I would extend that and ask. Has anybody completed implementing the Lexer/Parser phase? Once you get the Syntax tree or IR, the complexity is more or less (if not exactly the same) similar to other languages.


I've written a complete C++98 compiler, front to back.


How long did something like that take you to do?


WalterBright[0] wrote the first native code C++ compiler, Zortech C++, (later called Symantec C++ and recently DigitalMars C++) and is the man behind the D programming language.

[0] http://en.wikipedia.org/wiki/Walter_Bright


10 years


Mr. Bright, you're my hero. Really. Not only because you have made a C++ compiler using recusive descent parser and it's was the first ever to generate native code and the first comercial too and you don't even do generate assembly on dmd. You still made the D language and dmd compiler. A world saving tool from C and C++. It will rock the world. Could you imagine one of the next world's famous software like Windows or *BSD written in D? Also, I'm on the way on compiler business too (joke :-)). (sorry english, not my native language)


I am going to second arcft now. You are now my hero too! Wow, 10 years!


Nitpicking certainly but I would say use something like antlr rather than peg.js

antlr "schemas" are extremely close to ebnf's (some exceptions obviously) and the compiler creation club could benefit from that. Antlr 3 is usable in a variaty of languages (although 4 is a hellavalot simpler and solves a fair bit around left recursive grammers, but non jvm support is non existent still I suspect)


I started writing my own compile-to-JS language a while back after reading this book: http://createyourproglang.com/ and I'm loving it!

However I'm having a bit of trouble right now with my grammar, I'm using Jison (http://jison.org) and the error messages are kind of confusing (I didn't even know they were error messages at first, I thought they were some kind of logging). I apparently have shift-reduce conflicts just about everywhere, but I didn't bother solving them as I was building the rest of the language since everything is working fine! (well I'm sure there are edge cases that I haven't ran into yet)

So if anyone here has a good link on the basics of parsing, grammars, LR(1) and whatnot, I'd like to understand what I'm doing ;)


Although a bit out of date, the Dragon book is always worth a read.

http://dragonbook.stanford.edu/


Ugh, Hacker News you are better than this. THIS was posted yesterday: http://news.ycombinator.com/item?id=6792225.

It links to the original source. Why are we not giving the original author credit, instead of linking to blog spam on tech.pro?


Original author here - I work at tech.pro and decided to syndicate it across to the site (it even references it down the bottom of the article!).


ack! I seem to have yet again made a fool of myself.


I am currently following a compiler class, and I must say that I am really amazed and impressed. For an assignment, we of course had to write an interpreter for our own mini language. The fealing you have after creating this interpreter was overwhelming.

For the lexer and parser, we used SableCC, an object oriented framework that generates a compiler in Java. I've never used anything else (yacc, lex, etc), so I can't compare the tool, but it provides a rich, useful and easy interface to use.


If only SableCC could compile its Java 1.7 grammar without blowing up in memory usage, I would love it even more. As it is, I used it in my research when I needed a parser. Can't recommend it enough!


SableCC is quite good.

yacc and lex are handy to have around, but feel like Jurassic tools when compared to more modern parser generators tooling like ANTLR.


kids these days, when I wrote my first compiler I had to write my own yacc equivalent, and parser

FYI: My generator allowed for dynamic resolution of shift/reduce conflicts which allows you to compile some languages yacc wont let you (languages that let you change the priority of operators for example)


Given that one of my CS specialization areas was compiler design, me too.

I implemented a left recursive parser in x86 Assembly for MS-DOS systems.

As I said, yacc and lex are nice to have, but nowadays there is little incentive to keep using them.

Specially as you say, they are not able to parse all types of languages.


Did something similar in 6800 assembly, left recursion of course limits you even more - but it helps fit that tiny tiny compiler into 2k.

I actually like yacc/bison - they're good for most purposes and deliberately designing a non-LALR (or more a non-LR) language on purpose (rather than because you don't know any better) is usually silly - you do need to 'get' the concept of building a parse tree from bottom up - assembling it from larger and larger snippets as you go

Oh and yacc/bison run about 10 times faster than equivalent I wrote 10 years before they existed so I'm not complaining

I tend to use the same bespoke lexical analyser I've used for years and hack it to suit - it includes support for symbol tables/etc and runs really fast, no need to reinvent the wheel


Nit: Not all languages have a separation between statements and expressions. Most lisps just have expressions.


Bigger nit: If lisps just have expressions (and not statements), then they clearly have a separation between the two :).


And outside of the LISP family, Algol was also much more expression-oriented and orthogonal than the languages it spawned.


You can also learn a lot by studying (and even contributing to) other open source compilers. Here are about 100 javascript-based compilers, including tools for compiler writers at the bottom: https://github.com/jashkenas/coffee-script/wiki/List-of-lang...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: