vitorsalmeida

Building a JSON Parser from scratch with JS

Sat, 12 Aug 2023 00:00:00 GMT

Introduction

A parser can have various applications in everyday life, and you probably use some parser daily. Babel, webpack, eslint, prettier, and jscodeshift. All of them, behind the scenes, run a parser that manipulates an Abstract Syntax Tree (AST) to do what you need - we'll talk about that later, don't worry.

The idea of this text is to introduce the concept of lexing and parsing, implementing them using JavaScript to analyze expressions in JSON. The goal will be to separate this process into functions, explain these functions, and, in the end, have you implement a JSON parser generating an AST.

It's worth noting that my repository is open, and you can access it here.

Lexing

A lexer will be responsible for converting an expression, whatever it may be, into tokens. These tokens are identifiable elements that have an assigned meaning.

You can divide these tokens in several ways, such as identifier, keyword, delimiter, operator, literal, and comment.

{ "type": "LEFT_BRACE", "value": undefined } is an example of a delimiter. { "type": "STRING", "value": "name" } is an example of a literal.

Example: {"name":"Vitor","age":18}

[
  { "type": "LEFT_BRACE", "value": undefined },
  { "type": "STRING", "value": "name" },
  { "type": "COLON", "value": undefined },
  { "type": "STRING", "value": "Vitor" },
  { "type": "COMMA", "value": undefined },
  { "type": "STRING", "value": "age" },
  { "type": "COLON", "value": undefined },
  { "type": "NUMBER", "value": "18" },
  { "type": "RIGHT_BRACE", "value": undefined }
]

Note that in lexical analysis, you separate your expression into tokens, and each token has its identification.

To code this, let's first understand what we will be doing exactly. The idea of the lexer function is to receive an argument of type String and return an Array of tokens, which will be our JSON divided into specific types of information, as we have already seen and discussed.

To achieve this, we will create a variable called current, which will store the current position of the character in the input being analyzed by the lexer. In other words, it represents the position it is currently at in our JSON. Additionally, we will have a constant called tokens, which will be an array that will hold all of our tokens in the end.

export const lexer = (input: string): Token[] => {
let current = 0;
const tokens: Token:[] = [];


}

Now, we need to run a loop that will iterate until all the characters of the input have been processed.

Note that it's possible to refactor all these if blocks into a switch statement. However, I followed an imperative style.

export const lexer = (input: string): Token[] => {
  let current = 0
  const tokens: Token[] = []


  while (current < input.length) {
    let char = input[current]


    if (char === '{') {
      tokens.push(createToken(TOKEN_TYPES.LEFT_BRACE))
      current++
      continue
    }


    if (char === '}') {
      tokens.push(createToken(TOKEN_TYPES.RIGHT_BRACE))
      current++
      continue
    }


    if (char === '[') {
      tokens.push(createToken(TOKEN_TYPES.LEFT_BRACKET))
      current++
      continue
    }


    if (char === ']') {
      tokens.push(createToken(TOKEN_TYPES.RIGHT_BRACKET))
      current++
      continue
    }


    if (char === ':') {
      tokens.push(createToken(TOKEN_TYPES.COLON))
      current++
      continue
    }


    if (char === ',') {
      tokens.push(createToken(TOKEN_TYPES.COMMA))
      current++
      continue
    }


    const WHITESPACE = /\s/


    if (WHITESPACE.test(char)) {
      current++
      continue
    }


    const NUMBERS = /[0-9]/
    if (NUMBERS.test(char)) {
      let value = ''
      while (NUMBERS.test(char)) {
        value += char


        char = input[++current]
      }
      tokens.push(createToken(TOKEN_TYPES.NUMBER, value))
      continue
    }


    if (char === '"') {
      let value = ''
      char = input[++current]
      while (char !== '"') {
        value += char
        char = input[++current]
      }
      char = input[++current]
      tokens.push(createToken(TOKEN_TYPES.STRING, value))
      continue
    }


    if (
      char === 't' &&
      input[current + 1] === 'r' &&
      input[current + 2] === 'u' &&
      input[current + 3] === 'e'
    ) {
      tokens.push(createToken(TOKEN_TYPES.TRUE))
      current += 4
      continue
    }


    if (
      char === 'f' &&
      input[current + 1] === 'a' &&
      input[current + 2] === 'l' &&
      input[current + 3] === 's' &&
      input[current + 4] === 'e'
    ) {
      tokens.push(createToken(TOKEN_TYPES.FALSE))
      current += 5
      continue
    }


    if (
      char === 'n' &&
      input[current + 1] === 'u' &&
      input[current + 2] === 'l' &&
      input[current + 3] === 'l'
    ) {
      tokens.push(createToken(TOKEN_TYPES.NULL))
      current += 4
      continue
    }


    throw new TypeError('I dont know what this character is: ' + char)
  }


  return tokens
}

This code may seem complicated, but it's actually quite simple. Inside my loop, I start by defining the variable char, which will store the character currently being analyzed in that iteration of the loop. Then, for each type of character, we want a specific action.

If the char is equal to {, we push into the tokens array, passing to it the ENUM of Tokens, which will be the name of each token.

export enum TOKEN_TYPES {
  LEFT_BRACE = 'LEFT_BRACE',
  RIGHT_BRACE = 'RIGHT_BRACE',
  LEFT_BRACKET = 'LEFT_BRACKET',
  RIGHT_BRACKET = 'RIGHT_BRACKET',
  COLON = 'COLON',
  COMMA = 'COMMA',
  STRING = 'STRING',
  NUMBER = 'NUMBER',
  TRUE = 'TRUE',
  FALSE = 'FALSE',
  NULL = 'NULL',
}

In the function createToken, it only returns an object with the type or value if it exists.

export const createToken = (type: TOKEN_TYPES, value?: string): Token => {
  return {
    type,
    value,
  }
}

After that, it increments +1 to the current variable to move to the next char in our string. This process is quite repetitive and straightforward. I won't explain each one of them, but it's worth paying attention to those that deviate a bit from the pattern.

if (char === '"') {
  let value = ''
  char = input[++current]
  while (char !== '"') {
    value += char
    char = input[++current]
  }


  char = input[++current]
  tokens.push(createToken(TOKEN_TYPES.STRING, value))
  continue
}


if (
  char === 't' &&
  input[current + 1] === 'r' &&
  input[current + 2] === 'u' &&
  input[current + 3] === 'e'
) {
  tokens.push(createToken(TOKEN_TYPES.TRUE))
  current += 4
  continue
}


if (
  char === 'f' &&
  input[current + 1] === 'a' &&
  input[current + 2] === 'l' &&
  input[current + 3] === 's' &&
  input[current + 4] === 'e'
) {
  tokens.push(createToken(TOKEN_TYPES.FALSE))
  current += 5
  continue
}


if (
  char === 'n' &&
  input[current + 1] === 'u' &&
  input[current + 2] === 'l' &&
  input[current + 3] === 'l'
) {
  tokens.push(createToken(TOKEN_TYPES.NULL))
  current += 4
  continue
}

If the char is equal to ", we enter a loop to continue reading the subsequent characters until we find another quotation mark, as this indicates the end of the string. All the characters read during this loop are concatenated into the "value" variable. Then, a new token is added to the "tokens" array.

The other lines are used to set boolean values. If the current character is "f" and the subsequent characters form the word false, we add false to the tokens array. The same process is repeated for true.

In all cases, the current variable is incremented to point to the next character to be processed, just like we did throughout the rest of our code.

Parsing

A parser is responsible for transforming a sequence of tokens into a data structure, in this case, an AST.

Illustration of an AST taken from the book "Modern Compiler Implementation in ML."

An Abstract Syntax Tree (AST) is a data structure that represents the syntactic structure of a program. Within the AST, there are several nodes, and each node represents a valid syntactic construct of the program. For example:

parser: {
  "type": "Program",
  "body": [
    {
      "type": "ObjectExpression",
      "properties": [
        {
          "type": "Property",
          "key": {
            "type": "STRING",
            "value": "name"
          },
          "value": {
            "type": "StringLiteral",
            "value": "Vitor"
          }
        },
        {
          "type": "Property",
          "key": {
            "type": "STRING",
            "value": "age"
          },
          "value": {
            "type": "NumberLiteral",
            "value": "18"
          }
        }
      ]
    }
  ]
}

This is the AST of the JSON I exemplified at the beginning. In this example, we have a total of 8 nodes. The Program node represents the main program. ObjectExpression represents an object, Property represents a property within an object, consisting of a key and a value. STRING represents a string used as a key, StringLiteral represents a string within a property, and NumberLiteral represents a numeric value within a property.

Through an AST, it's possible to optimize code, transform one code into another, perform static analysis, generate code, and more. For example, you could implement a new syntax, create a parser, and generate JavaScript code that would execute normally.

To generate our AST, we will need a function that receives our array of tokens, iterates through it, and generates the AST according to the tokens it encounters. For this purpose, we will create a function called walk, which will traverse the tokens and return the nodes of the AST.

export const parser = (tokens: Array<{ type: string; value?: any }>) => {
    let current = 0;


    const walk = () => {
        let token = tokens[current];


        if (token.type === TOKEN_TYPES.LEFT_BRACE) {
            token = tokens[++current];


            const node: {
                type: string;
                properties?: Array<{ type: string; key: any; value: any }>;
            } = {
                type: 'ObjectExpression',
                properties: [],
            };


            while (token.type !== TOKEN_TYPES.RIGHT_BRACE) {
                const property: { type: string; key: any; value: any } = {
                    type: 'Property',
                    key: token,
                    value: null,
                };


                token = tokens[++current];


                token = tokens[++current];
                property.value = walk();
                node.properties.push(property);


                token = tokens[current];
                if (token.type === TOKEN_TYPES.COMMA) {
                    token = tokens[++current];
                }
            }


            current++;
            return node;
        }

The first check we perform is whether the current token is {. If it is, we create a new node of type ObjectExpression and iterate through the following tokens, adding them as properties of the object until we find the end of the key, which is }. Each property is represented by a node of type Property. This Property type has a value that is generated by the walk() function, which is called recursively.

Note that I use tokens[++current]. This is to advance the cursor to the next token. And if a token of type , (comma) is found, we advance the cursor again to skip the comma.

The rest of the code is quite similar to what I've just explained, so it's worth the effort to look and try to understand or implement the rest on your own. It's not very complex.

Finally, I create the constant ast, which will contain the type Program and the body of the AST, generated by the walk() function. The while loop ensures that current does not exceed the size of the tokens array.

After that, we simply return the AST.

export const parser = (tokens: Array<{ type: string; value?: any }>) => {
  let current = 0


  const walk = () => {
    let token = tokens[current]


    if (token.type === TOKEN_TYPES.LEFT_BRACE) {
      token = tokens[++current]


      const node: {
        type: string
        properties?: Array<{ type: string; key: any; value: any }>
      } = {
        type: 'ObjectExpression',
        properties: [],
      }


      while (token.type !== TOKEN_TYPES.RIGHT_BRACE) {
        const property: { type: string; key: any; value: any } = {
          type: 'Property',
          key: token,
          value: null,
        }


        token = tokens[++current]


        token = tokens[++current]
        property.value = walk()
        node.properties.push(property)


        token = tokens[current]
        if (token.type === TOKEN_TYPES.COMMA) {
          token = tokens[++current]
        }
      }


      current++
      return node
    }


    if (token.type === TOKEN_TYPES.RIGHT_BRACE) {
      current++
      return {
        type: 'ObjectExpression',
        properties: [],
      }
    }


    if (token.type === TOKEN_TYPES.LEFT_BRACKET) {
      token = tokens[++current]


      const node: {
        type: string
        elements?: Array<{ type?: string; value?: any }>
      } = {
        type: 'ArrayExpression',
        elements: [],
      }


      while (token.type !== TOKEN_TYPES.RIGHT_BRACKET) {
        node.elements.push(walk())
        token = tokens[current]


        if (token.type === TOKEN_TYPES.COMMA) {
          token = tokens[++current]
        }
      }


      current++
      return node
    }


    if (token.type === TOKEN_TYPES.STRING) {
      current++
      return {
        type: 'StringLiteral',
        value: token.value,
      }
    }


    if (token.type === TOKEN_TYPES.NUMBER) {
      current++
      return {
        type: 'NumberLiteral',
        value: token.value,
      }
    }


    if (token.type === TOKEN_TYPES.TRUE) {
      current++
      return {
        type: 'BooleanLiteral',
        value: true,
      }
    }


    if (token.type === TOKEN_TYPES.FALSE) {
      current++
      return {
        type: 'BooleanLiteral',
        value: false,
      }
    }


    if (token.type === TOKEN_TYPES.NULL) {
      current++
      return {
        type: 'NullLiteral',
        value: null,
      }
    }


    throw new TypeError(token.type)
  }


  const ast = {
    type: 'Program',
    body: [],
  }


  while (current < tokens.length) {
    ast.body.push(walk())
  }


  return ast
}

const tokens = lexer('{"name":"Vitor","age":18}')
console.log('tokens', tokens)
const json = parser(tokens)


console.log('parser:', JSON.stringify(json, null, 2))

Parsing can be fun, but in reality, it's not often written, and you probably don't need to write your own parser. If you are implementing a programming language, for example, there are already various tools that can do this job for you, such as OCamllex, Menhir, or Nearley.

It's also essential to note that, as this is an introductory article, I didn't cover the different tokenization and parsing techniques, such as LR(0), LR(1), SLR(1), etc. However, be aware that these techniques exist, and you can research more about them. There are also many books that cover these topics.

If you want to see how the AST of popular languages looks, I recommend the AST Explorer. It supports various languages, and you can view the complete AST and navigate through the nodes. If you want to go further, you can try to copy some logic from an existing parser and implement it in your own, such as calculating an expression according to precedence order, for example: 1 + 2 * 3 (which is 7, not 9).

If you're interested in learning more, I recommend the book "Modern Compiler Implementation in ML." Despite the title being in ML, you can study from it without necessarily writing ML code, as there are other versions written in C, C++, and Java.

Introduction to Big O notation

Sat, 12 Aug 2023 00:00:00 GMT

Introduction

In this post, we will explore the concept of algorithm efficiency and how to measure this efficiency using big O notation. Additionally, we will see how this can help us write more performance code. Big O notation allows us to evaluate the performance of an algorithm according to the size of its data set.

What is big O notation and algorithm efficiency?

Algorithm efficiency is the ability to solve a problem in a reasonable time and with efficient use of computational resources such as processing and memory. Big O notation is a way of measuring the efficiency of an algorithm, describing the asymptotic behavior of a function. This means that we can evaluate the rate of growth of the function, comparing it to other algorithms.

Big O notation is represented by the letter O and is used as follows: Θ(f(n)), where f(n) is the measure by which the size of the input (n) increases. For example, an algorithm that grows quadratically, that is, increases proportionally to the square of each additional input, is represented by Θ(n²).

Constant time: Θ(1)
Linear time: Θ(n)
Logarithmic time: Θ(log(n))
Quadratic time: Θ(n²)

Big O notation in code

Suppose you have a list and need to find a certain element 'x'. This algorithm is called sequential search.

Another example is when you need to sort a list. This algorithm is called insertion sort.

Sequential search example

// linear time
function sequentialSearch(arr: number[], x: number) {
  for (let i = 0; i < arr.length; i++) {
    if (arr[i] === x) {
      return i
    }
  }
  return -1
}

Insertion sort example

// quadratic time
function insertionSort(arr: number[]) {
  for (let i = 1; i < arr.length; i++) {
    let currentVal = arr[i]
    for (var j = i - 1; j >= 0 && arr[j] > currentVal; j--) {
      arr[j + 1] = arr[j]
    }
    arr[j + 1] = currentVal

    console.log(arr)
  }

  return arr
}

Both codes work and solve the proposed problem, but one is more performative than the other. The first code is linear, which means that the for loop will be executed a number of times directly proportional to the size of the array. This means that if the array has n elements, the loop will be executed n times, which can be represented by Θ(n).

The advantage of this approach is that, in case of larger arrays, the code will run faster, as the number of iterations is proportional to the size of the array. This makes the time complexity of the code limited by the size of the array, resulting in a less steep growth chart compared to a code with quadratic complexity. In other words, the first code is more performative and efficient in situations where the array can be very large.

The second code is an example of quadratic complexity Θ(n²). This means that the for loop inside the for loop will be executed a number of times proportional to the square of the size of the array. In other words, if the array has n elements, the inner loop will be executed n * n times, which can be represented by Θ(n²).

The implications of this complexity are that, in larger arrays, the growth chart will be steeper, resulting in slower and more complex code as the input gets larger.

Now, let's see some examples of code with logarithmic complexities O(log(n)) and O(n log(n)).

Suppose you receive a list of numbers and need to find a certain number x in the list. To do this, you can use the binary search algorithm, which has complexity O(log(n)).

Another example is when you need to sort a list of numbers in a logarithmic way. To do this, you can use the merge sort algorithm, which has complexity O(n log(n)).

These algorithms are more performative and efficient than linear or quadratic approaches in situations where the size of the input can be very large.

Binary search example

// O(log(n))
function binarySearch(arr: number[], x: number) {
  let left = 0
  let right = arr.length - 1

  while (left <= right) {
    let mid = Math.floor((left + right) / 2)
    if (arr[mid] === x) {
      return mid
    }
    if (arr[mid] < x) {
      left = mid + 1
    } else {
      right = mid - 1
    }
  }
  return -1
}

Merge sort example

// O(n log(n))
function mergeSort(arr: number[]) {
  if (arr.length === 1) {
    return arr
  }

  let mid = Math.floor(arr.length / 2)
  let left = arr.slice(0, mid)
  let right = arr.slice(mid)

  return merge(mergeSort(left), mergeSort(right))
}

function merge(left: number[], right: number[]) {
  let result = []
  let i = 0
  let j = 0

  while (i < left.length && j < right.length) {
    if (left[i] < right[j]) {
      result.push(left[i])
      i++
    } else {
      result.push(right[j])
      j++
    }
  }

  return result.concat(left.slice(i)).concat(right.slice(j))
}

Both examples are valid, but they have different time complexities. The first one is O(log(n)), which means that the running time increases logarithmically with respect to the size of the input. In other words, in the worst case, if the array has 8 elements, the algorithm will be executed 3 times. For example: log2(8) = 3.

The second example, O(n log(n)), is a notation that indicates that the running time of an algorithm increases proportionally to the product of the size of the input data and the logarithm of that size. This means that, in the worst case, if the array has 8 elements, the algorithm will be executed 24 times. For example: 8 * log2(8) = 24.

However, it is important to remember that the temporal complexity of an algorithm does not necessarily imply higher or lower speed. It is possible that an algorithm with worse complexity is faster than an algorithm with better complexity, depending on the specific input. However, in general, it is safe to say that the lower the complexity, the faster the algorithm will be, as the input increases.

For those who want to delve deeper into the subject, I recommend reading the book “Introduction to Algorithms” by Thomas H. Cormen.

Is Data Struct about memory?

Sun, 27 Aug 2023 00:00:00 GMT

A principle that I try to follow is to think about obvious concepts. Since we are introduced to society, we meet with a lot of concepts that we don't even question. Some authoritary figure like a teacher or a parent tells us things that we just accept.

If you think about data structure, you can relate it immediately to memory. Today, I want to discuss why data structures are not only about memory.

Introduction

For instance, let’s understand that we do not have any problem thinking about data structures to organize data in memory. The problem is never the idea that data structures are indifferent to implementation.

Data Structure is a way to organize, manage, and store data (yes, you can store things in math. If you think about matrices or sets, you can notice that is a way to organize data. In practice, store things is about to organize things). It is a collection of values that we can realize operations. In essence, it is an Algebraic Structure of data.

Algebraic Structure

An algebraic Structure is a set of elements with one or more operations that satisfy some specific properties (axioms). An example is Boolean Algebra, a set with two binary operations, union and intersection. You can follow this link to understand more about Boolean Algebra.

At this point, I want to highlight that algebraic structures are essentially about math. It describes the properties of a set of elements: a group, a ring, an algebra, etc. If you want to dive into this topic, read Algebraic Structures.

It is a big mistake to take an abstract and mathematical concept and reduce it to a simple implementation that needs physical and limited resources. You do not need any computer or memory to understand or use data structures. You need data. And data in math is just a set of elements.

Let's to think about computational models

The root of this problem is that you maybe don't know about computational models. It is a important step because if you think that in computer science we are using math to model computational problems, and computational problems are mathematical problems, you probably never will reduce any computational problem to a phisical and limited resource (like memory).

Computer science problems follow math problems, and we use math to model them. It means that the problem that you're trying to solve, can be modeled by a mathematical problem before you start to think about some computational implementation.

Conclusion

I hope that you understand that data structures are not only about memory. It is a mathematical concept that we use to model computational problems.

This text is a little talk about some ideas that I would like to share. I tried to give some references to help if you want to dive into this topic. Also, it's very important to think about obvious concepts and try to understand why the things are the way they are.

Proving natural numbers are infinity in Coq

Sun, 13 Aug 2023 00:00:00 GMT

Introduction

An interesting fact about math is that you can demonstrate things. A lot of people thinks that demonstrate is exemplify something, but it is not. It's common to see people saying that $1 + 1 = 2$, just like one apple plus one apple is two apples. But, this is not a demonstration. A demonstration is a logical proof that something is true, by a formal way. You cannot say that every 1 plus itself is equal to two. Boolean algebra is a good example of this. In Boolean algebra, $1 + 1 = 1$. So, you can't say that aways $1 + 1 = 2$.

What is Coq?

Coq is a software that allows you to write proofs. It is a proof assistant based on the calculus of inductive constructions. It is a functional programming language based on lambda calculus. I won't talk about lambda calculus and the calculus of inductive constructions, but you can find more information about them on Wikipedia.

What is a formal proof?

A formal proof is a process based on logical rules and axioms used to demonstrate some theorem. The goal is to show an absolute truth that some affirmation is faithful by following strict rules.

The proof

A simple way to prove that natural numbers are infinite is by defining that given a natural number ($n: nat$), $n + 1$ will return natural numbers greater than $n$.

You need to make it because this theorem proves that natural numbers are infinite by showing that given any natural numbers, you can always find a natural number greater than it, by summing $1$ to it.

So, lets to define our theorem in Coq:

Theorem plus_1_natural : forall n : nat, 1 + n = S n /\ S n > n.

The forall keyword means the theorem is valid for all natural numbers. The $\land$ means the theorem is a conjunction, and the $S$ is the successor function. So, the theorem says that given a natural number $n$, $1 + n$ is equal to $S n$, and $S n$ is greater than $n$.

Lets to prove the theorem:

Proof.
  intros n.
  split.
  - reflexivity.
  - apply le_n_S. apply le_n.
Qed.

intros n introduces the universal quantifier forall and the arbitrary natural number $n$ as a variable. The split. splits the objective into two subgoals. One for each conjunct $\land$. The hyphen is to refer to the subgoal. The reflexivity. is to prove that something equals to itself. Like $1 + n = S n$, because $1 + n$ is equal to $S n$. You can reduce it, like:

$1 + 1 = 2$.
$2 = 2$.

Coq have some macros that abstracts some proofs. We can see this reducing using simpl.:

Coq gives us some theorems related to natural numbers ordering, like le_n_S and le_n. These theorems can be used to reduce steps of our proof, replacing them with the applied theorem. In math, the le it is $\leq$, less than or equal.

le_n any natural number is less than or equal to itself. $n \leq n$ is a true statement for any natural number $n$.
le_n_S says that if $n \leq m$, then $S n \leq S m$. So, if $n \leq n$, then $S n \leq S n$.

The apply le_n_S. definition is le_n_S : forall n m : nat, n <= m -> S n <= S m, and le_n that le_n : forall n : nat, n <= n. So, to read the apply le_n_S. apply le_n. is: "Given a natural number n, if n <= n, then S n <= S n". And this is true, because n <= n is true, and S n <= S n is true too. So, the theorem is proved. Qed. means that the proof is finished.

You can also see it on CoqIDE:

Conclusion

You can also show the output of the proof and run the proof. So, lets do it.

You can use Print to show the definition of some theorem, and Eval compute in to run the proof. In the end, our proof will be:

Theorem plus_1_natural : forall n : nat, 1 + n = S n /\ S n > n.
Proof.
  intros n.
  split.
  - reflexivity.
  - apply le_n_S. apply le_n.
Qed.
Print plus_1_natural.
Eval compute in (plus_1_natural 1).

And the output will be:

plus_1_natural = fun n : nat => conj eq_refl (le_n_S n n (le_n n))
: forall n : nat, 1 + n = S n /\ S n > n

= plus_1_natural 1
: 1 + 1 = 2 /\ 2 > 1