Compiler

The PHP interpreter’s compiler processes PHP source code in three phases: lexical analysis, syntactic analysis, and bytecode generation. In the following sections, we will take a look at what happens in which phase.

Lexical Analysis

The first step in compiling PHP source code into PHP bytecode is the lexical analysis. The lexer converts a sequence of characters into a sequence of words, so-called tokens, that have a meaning in the programming language. Consider the following example:

for ($i = 0; $i < 4; $i++) {
    print $i . PHP_EOL;
}

After the lexical analysis, the stream of tokens shown below is what the compiler knows about our piece of code:

01 FOR             for          18 VARIABLE        $i
02 WHITESPACE                   19 INC             ++
03 OPEN_BRACKET    (            20 CLOSE_BRACKET   )
04 VARIABLE        $i           21 WHITESPACE
05 WHITESPACE                   22 OPEN_CURLY      {
06 EQUAL           =            23 WHITESPACE
07 WHITESPACE                   24 PRINT           print
08 LNUMBER         0            25 WHITESPACE
09 SEMICOLON       ;            26 VARIABLE        $i
10 WHITESPACE                   27 WHITESPACE
11 VARIABLE        $i           28 DOT             .
12 WHITESPACE                   29 WHITESPACE
13 LT              <            30 STRING          PHP_EOL
14 WHITESPACE                   31 SEMICOLON       ;
15 LNUMBER         4            32 WHITESPACE
16 SEMICOLON       ;            33 CLOSE_CURLY     }
17 WHITESPACE

While the compiler already knows from the lexical analysis that $i is a variable it does not yet know whether a variable is allowed in that context. It also does not know yet that PHP_EOL is the name of a constant, for instance. Making sense of the tokens is the job of the next compilation step, the syntactic analysis phase. We will get to that shortly.

The lexical analysis of PHP’s compiler has been improved for version 7 and now supports semi-reserved words. This means that the following keywords can now be used as names for constants, methods, and properties in classes, interfaces, and traits:

abstract, and, array, as, break, callable, case, catch, class, clone, const, continue, declare, default, die, do, echo, else, elseif, enddeclare, endfor, endforeach, endif, endswitch, endwhile, eval, exit, extends, final, finally, fn, for, foreach, function, global, goto, if, implements, include, include_once, instanceof, insteadof, interface, isset, list, namespace, new, or, parent, print, private, protected, public, require, require_once, return, self, static, switch, throw, trait, try, unset, use, var, while, xor, yield

There is one exception, though: while the keyword class can be used as the name for methods and properties it cannot be used as the name for a class constant.

The following words cannot be used as names for classes, interfaces, traits, or namespaces:

bool, false, float, int, iterable, null, object, string, true, void

The words mixed, numeric, and resource are soft-reserved. They can still be used as names for classes, interfaces, traits, or namespaces but it is discouraged to do so.

Syntactic Analysis

The second step in compiling PHP source code into PHP bytecode is the syntactic analysis. The parser analyses the tokens produced during the lexical analysis phase. It checks whether the sequence of tokens conforms to the grammar of the PHP language and builds an intermediate representation of the sourcecode based on which bytecode will be generated in the next compilation phase.

Before PHP 7, the PHP runtime’s compiler directly generated PHP bytecode from within the parser without a defined intermediate representation of the source code. PHP 7 moved away from this arcane way of compiling and adopted the standard practice of using an abstract syntax tree as the result of the parsing step. Figure 2.1 shows a graphical representation of the abstract syntax tree generated for the simple loop we looked at before.

Abstract Syntax Tree (AST) for a for loop

The rewrite of the PHP compiler’s parser to use abstract syntax trees makes changes and extensions to PHP’s syntax easier and less risky than in the past. Extensions for the PHP runtime such as php-ast or astkit provide access to the abstract syntax tree of PHP programs from within PHP code. This makes the development of static analysis tools in PHP a lot easier.

Bytecode Generation

The final step in compiling PHP source code is to generate PHP bytecode. Using the information learned in the syntactic analysis and stored in an abstract syntax tree, the bytecode generator will produce a so-called operations array, oparray for short, for each function or method of a PHP source code file. Code that is declared in the global scope (outside the scope of a function or method) of a source code file will be compiled into a separate oparray.

Each oparray contains one or more operation lines, or oplines for short. Each opline contains one operation code, or opcode for short, that specifies the operation to be performed as well as the operands it is carried out on.

compiled vars:  !0 = $i
line     #* E I O op          fetch          ext  return  operands
--------------------------------------------------------------------
   2     0  E >   ASSIGN                                  !0, 0
         1    >   IS_SMALLER                      ~1      !0, 4
         2      > JMPZNZ                       6          ~1, ->10
         3    >   POST_INC                        ~2      !0
         4        FREE                                    ~2
         5      > JMP                                     ->1
   3     6    >   CONCAT                          ~3      !0, '%0A'
         7        PRINT                           ~4      ~3
         8        FREE                                    ~4
   4     9      > JMP                                     ->3
   6    10    > > RETURN                                  1

For our simple loop, PHP 5 generates eleven instructions (see above) whereas PHP 7 only generates eight (see below).

compiled vars:  !0 = $i
line     #* E I O op          fetch          ext  return  operands
--------------------------------------------------------------------
   2     0  E >   ASSIGN                                  !0, 0
         1      > JMP                                     ->5
   3     2    >   CONCAT                          ~1      !0, '%0A'
         3        ECHO                                    ~1
   2     4        PRE_INC                                 !0
         5    >   IS_SMALLER                      ~1      !0, 4
         6      > JMPNZ                                   ~1, ->2
   6     7    > > RETURN                                  1

The PHP bytecode is made up of over 170 opcodes such as ASSIGN for assigning a value to a variable, ECHO for producing output, or JMPNZ for jumping to a specified address if a register value is not zero, for instance.

The performance improvements of PHP 7 over PHP 5 stem to some extent from generating better bytecode.