2023年4月8日星期六 下午2:29
Here's my note for the course 'Build your own programming language' by Steve Kinney
I have always been interested in how operating system as well as programming language works. Things like Babel, and AST those kinds of stuff also appeal to me. This is a very cool opportunity to take a glance at how they work as a whole.
Here we will build our own customed programming language based on JavaScript, and Babel. The idea is to get to know what makes up a programming language, and how the compiler understands the keyword.
// our goal is to build a programming language as follow
(add 3 4) // output: 7
(multiply 4 5 (subtract 4 2)) // output: 40
It won't be easy to enclose the whole codebase into this note, but I'll try to record the most important concept here.
For any programming language. We need a compiler as a communication bridge between us the developer and the computer needed to read and operate the code. A compiler consists of mainly four parts:
Lexical analysis
: Read the source code one character at a time and converts it into meaningful tokens (Tokenization).Syntactic analysis
: Take the tokens and produce a parse tree as an output, properly nested data structure (Abstract Syntax Tree).Transformation
: Manipulate the AST and make modifications.Generation
: Based on our custom AST and making it fit in the Babel-compliant AST, it can then generate the JavaScript code.In this case where we're using JavaScript which used an interpreter, the idea is the same with the compiler, just the compiler's works run in runtime in interpreter.
We don't just want any single characters, by also the pieces as a whole with the keywords or the entire string. We need to break them into a meaningful chunk.
In other word, take the big string of code and turn it into tokens.
const tokenize = input => {
let cursor = 0;
const tokens = [];
while (cursor < input.length) {
// We need some helper function like regex to determine the type of the character
// Will use some customed helper function here like isParenthesis(), isNumber()...
const character = input[cursor];
if (isParenthesis(character)) {
tokens.push({
type: 'Parenthesis',
value: character,
});
cursor++;
continue;
}
if (isWhitespace(character)) {
cursor++;
continue;
}
// Also focus the logic on Number where it's more than 1 digit
if (isNumber(input[cursor])) {
let number = character;
while(isNumber(input[++cursor])) {
number += input[cursor];
}
tokens.push({
type: 'Number',
value: parseInt(number, 10);
});
continue;
}
// The case for letters skipped...
}
return tokens;
}
We will get the result like this:
// input: (add 2 3)
/* output: [
{ type: 'Parenthesis', value: '(' },
{ type: 'Name', value: 'add' },
{ type: 'Number', value: 2 },
{ type: 'Number', value: 3 },
{ type: 'Parenthesis', value: ')' },
];
*/
In this stage, we simply want to group the expressions altogether based on whether the tokens are in the same function or nested function based on the parenthesis.
Turn the tokens into an abstract syntax tree.
// input = "(add 1 3 (subtract 5 2))"
// From the lexer part, we have:
const tokens = [
{type: 'Parenthesis', value: '('},
{type: 'Name', value: 'add'},
{type: 'Number', value: 1},
{type: 'Number', value: 3},
{type: 'Parenthesis', value: '('},
{type: 'Name', value: 'subtract'},
{type: 'Number', value: 5},
{type: 'Number', value: 2},
{type: 'Parenthesis', value: ')'},
{type: 'Parenthesis', value: ')'},
];
// Then we need to turn the tokens into our customed but nearly standard AST:
const ast = {
type: 'CallExpression',
name: 'add',
arguments: [
{type: 'Number', value: 1},
{type: 'Number', value: 3},
{
type: 'CallExpression',
name: 'subtract',
arguments: [
{type: 'Number', value: 5},
{type: 'Number', value: 2},
]
}
]
}
Before assigning a new type, arguments and so, we need to make the tokens become a tree-like data structure. We will nest the data based on the parenthesis and throw it away after that.
const parenthesize = tokens => {
const token = pop(tokens);
if (isOpeningParenthesis(token.value)) {
const expression = [];
while (!isClosingParenthesis(peek(tokens).value)) {
expression.push(parenthesize(tokens));
}
pop(tokens);
return expression;
}
return token;
};
Here's what the results look like:
/* input: const tokens = [
{type: 'Parenthesis', value: '('},
{type: 'Name', value: 'add'},
{type: 'Number', value: 1},
{type: 'Number', value: 3},
{type: 'Parenthesis', value: '('},
{type: 'Name', value: 'subtract'},
{type: 'Number', value: 5},
{type: 'Number', value: 2},
{type: 'Parenthesis', value: ')'},
{type: 'Parenthesis', value: ')'},
];
*/
/* output: [
{ type: 'Name', value: 'add' },
{ type: 'Number', value: 1 },
{ type: 'Number', value: 3 },
[
{ type: 'Name', value: 'subtract' },
{ type: 'Number', value: 5 },
{ type: 'Number', value: 2 }
]
]
*/
The next step is to group the arguments of a function. The idea here is to see if the token type is Name
, or Number
. If it's a Number, we will push it into an array of arguments. Here's how we did it.
const parse = tokens => {
if (Array.isArray(tokens)) {
const [first, ...rest] = tokens;
return {
type: 'CallExpression',
name: first.value,
arguments: rest.map(parse),
};
}
const token = tokens;
if (token.type === 'Number') {
return {
type: 'NumericLiteral',
value: token.value,
};
}
if (token.type === 'Name') {
return {
type: 'Identifier',
name: token.value,
};
}
};
If we keep on feeding the result from the parenthesize function, we can get this:
console.log(parse(parenthesize(tokens));
/* output:
{
type: 'CallExpression',
name: 'add',
arguments: [
{ type: 'NumericLiteral', value: 1 },
{ type: 'NumericLiteral', value: 3 },
{ type: 'CallExpression',
name: 'subtract',
arguments: [
{type: 'NumbericLiteral', value: 5},
{type: 'NumbericLiteral', value: 2},
]
}
]
}
*/
Which is what we wanted.
In this step, we can manipulate the customed AST into what we wanted it to be. For example, we can replace all the +
sign to our own syntax add
.
Since we want to try out the transpile feature, we can transform the AST into a Babel-compliant AST. Which later on, Babel can help us generate JavaScript code from our own programming language, so we have to conform the Babal standard AST.
We can first try out transforming the +
sign into add
. Which can fit into our programming language. To make it easier to react based on the type of node, we will need the Visitor Pattern.
Visitor Pattern let us seperate an algorithm from an object structure, which means we don't have to change the class body in order to define a new operation.
This is very handy to make modifications in this case, when we try to expand our programming language and make new features.
const traverseNode = ({ node, parent, visitor }) => {
const methods = visitor[node.type];
if (methods && methods.enter) {
methods.enter({ node, parent });
}
if (node.arguments) { // traverse across siblings if exists
traverseArray({ array: node.arguements, parent: node, visitor });
}
if (methods && methods.exit) {
methods.exit({ node, parent });
}
};
const traverseArray = ({ array, parent, visitor }) => {
array.forEach(node => {
traverseNode({ node, parent, visitor });
});
};
const traverse = (node, visitor) => {
traverseNode({ node, visitor });
};
The visitor part will be defined outside of the file, and we can try it out:
// input token: '(+ 10 5)' instead of '(add 10 5)'
const ast = {
type: 'CallExpression',
name: '+',
arguments: [
{ type: 'NumericLiteral', value: 10 },
{ type: 'NumericLiteral', value: 5},
],
};
const visitor = {
CallExpression: {
enter({ node }) {
if (node.name === '+') {
node.name = 'add';
}
},
},
}
traverse(ast, visitor);
console.log(ast);
/* output
{
...,
name: 'add',
...
}
*/
The result shows that the name has been successfully replaced.
Let's try to transpile our programming language into JavaScript with Babel. We want it from our syntax (multiply 3 2 substract (4 1))
to this multiply(3, 2, substract(4, 1))
which is more like a JavaScript way.
Since we alreay have our own AST, the only thing we need to do is to make sure our AST conform the Babel-compliant AST. Then Babel will be able to take the AST to generate the JavaScript code.
const generate = require('@babel/generator').default;
const { traverse } = require('./traverse');
const babelVisitor = {
CallExpression: {
enter({ node} ) {
node.callee = { type: 'Identifier', name: node.name};
},
},
}
const toJavaScript = ast => {
traverse(ast, babelVisitor);
return generate(ast).code;
}
Here's the testing result:
const ast = {
type: 'CallExpression',
name: 'multiply',
arguments: [
{ type: 'NumericLiteral', value: 3 },
{ type: 'NumericLiteral', value: 2 },
{
type: 'CallExpression',
name: 'subtract',
arguments: [
{ type: 'NumericLiteral', value: 4 },
{ type: 'NumericLiteral', value: 1 },
],
},
],
};
console.log(toJavaScript(ast, babelVisitor);
// output: multiply(3, 2, subtract(4, 1))
Here we only show part of the work a compiler does, such as variable declaration is missing here. Which is also a common functionality of a programming language. But it's hard to show all the code here.
But it's not hard to implement a simple form of variable declaration. We can either do it in the syntactic analysis stage to assign a VariableDeclaration
type for it, then we can store the variable in the environment for later use. Or we can do it like a CallExpression
, and process it in the transform stage.
It's not easy to demonstrate the whole compile pipeline, but here we can have a glance at how a compiler works in an abstract way. I think it's cool to build your own programming language for fun and to get to know more about how it works behind the hood.