Compiler
The PHP interpreter’s compiler processes PHP source code in three phases: lexical analysis, syntactic analysis, and bytecode generation. In the following sections, we will take a look at what happens in which phase.
Lexical Analysis
The first step in compiling PHP source code into PHP bytecode is the lexical analysis. The lexer converts a sequence of characters into a sequence of words, so-called tokens, that have a meaning in the programming language. Consider the following example:
for ($i = 0; $i < 4; $i++) {
print $i . PHP_EOL;
}
After the lexical analysis, the stream of tokens shown below is what the compiler knows about our piece of code:
01 FOR for 18 VARIABLE $i
02 WHITESPACE 19 INC ++
03 OPEN_BRACKET ( 20 CLOSE_BRACKET )
04 VARIABLE $i 21 WHITESPACE
05 WHITESPACE 22 OPEN_CURLY {
06 EQUAL = 23 WHITESPACE
07 WHITESPACE 24 PRINT print
08 LNUMBER 0 25 WHITESPACE
09 SEMICOLON ; 26 VARIABLE $i
10 WHITESPACE 27 WHITESPACE
11 VARIABLE $i 28 DOT .
12 WHITESPACE 29 WHITESPACE
13 LT < 30 STRING PHP_EOL
14 WHITESPACE 31 SEMICOLON ;
15 LNUMBER 4 32 WHITESPACE
16 SEMICOLON ; 33 CLOSE_CURLY }
17 WHITESPACE
While the compiler already knows from the lexical analysis that
$i
is a variable it does not yet know whether a
variable is allowed in that context. It also does not know yet that
PHP_EOL
is the name of a constant, for instance. Making
sense of the tokens is the job of the next compilation step, the
syntactic analysis phase. We will get to that shortly.
The lexical analysis of PHP’s compiler has been improved for version 7 and now supports semi-reserved words. This means that the following keywords can now be used as names for constants, methods, and properties in classes, interfaces, and traits:
abstract
, and
, array
,
as
, break
, callable
,
case
, catch
, class
,
clone
, const
, continue
,
declare
, default
, die
,
do
, echo
, else
,
elseif
, enddeclare
, endfor
,
endforeach
, endif
, endswitch
,
endwhile
, eval
, exit
,
extends
, final
, finally
,
fn
, for
, foreach
,
function
, global
, goto
,
if
, implements
, include
,
include_once
, instanceof
,
insteadof
, interface
, isset
,
list
, namespace
, new
,
or
, parent
, print
,
private
, protected
, public
,
require
, require_once
,
return
, self
, static
,
switch
, throw
, trait
,
try
, unset
, use
,
var
, while
, xor
,
yield
There is one exception, though: while the keyword
class
can be used as the name for methods and
properties it cannot be used as the name for a class constant.
The following words cannot be used as names for classes, interfaces, traits, or namespaces:
bool
, false
, float
,
int
, iterable
, null
,
object
, string
, true
,
void
The words mixed
, numeric
, and
resource
are soft-reserved. They can still be used as
names for classes, interfaces, traits, or namespaces but it is
discouraged to do so.
Syntactic Analysis
The second step in compiling PHP source code into PHP bytecode is the syntactic analysis. The parser analyses the tokens produced during the lexical analysis phase. It checks whether the sequence of tokens conforms to the grammar of the PHP language and builds an intermediate representation of the sourcecode based on which bytecode will be generated in the next compilation phase.
Before PHP 7, the PHP runtime’s compiler directly generated PHP bytecode from within the parser without a defined intermediate representation of the source code. PHP 7 moved away from this arcane way of compiling and adopted the standard practice of using an abstract syntax tree as the result of the parsing step. Figure 2.1 shows a graphical representation of the abstract syntax tree generated for the simple loop we looked at before.
The rewrite of the PHP compiler’s parser to use abstract syntax
trees makes changes and extensions to PHP’s syntax easier and less
risky than in the past. Extensions for the PHP runtime such as php-ast
or
astkit
provide access to the abstract syntax tree of PHP programs from
within PHP code. This makes the development of static analysis tools
in PHP a lot easier.
Bytecode Generation
The final step in compiling PHP source code is to generate PHP bytecode. Using the information learned in the syntactic analysis and stored in an abstract syntax tree, the bytecode generator will produce a so-called operations array, oparray for short, for each function or method of a PHP source code file. Code that is declared in the global scope (outside the scope of a function or method) of a source code file will be compiled into a separate oparray.
Each oparray contains one or more operation lines, or oplines for short. Each opline contains one operation code, or opcode for short, that specifies the operation to be performed as well as the operands it is carried out on.
compiled vars: !0 = $i
line #* E I O op fetch ext return operands
--------------------------------------------------------------------
2 0 E > ASSIGN !0, 0
1 > IS_SMALLER ~1 !0, 4
2 > JMPZNZ 6 ~1, ->10
3 > POST_INC ~2 !0
4 FREE ~2
5 > JMP ->1
3 6 > CONCAT ~3 !0, '%0A'
7 PRINT ~4 ~3
8 FREE ~4
4 9 > JMP ->3
6 10 > > RETURN 1
For our simple loop, PHP 5 generates eleven instructions (see above) whereas PHP 7 only generates eight (see below).
compiled vars: !0 = $i
line #* E I O op fetch ext return operands
--------------------------------------------------------------------
2 0 E > ASSIGN !0, 0
1 > JMP ->5
3 2 > CONCAT ~1 !0, '%0A'
3 ECHO ~1
2 4 PRE_INC !0
5 > IS_SMALLER ~1 !0, 4
6 > JMPNZ ~1, ->2
6 7 > > RETURN 1
The PHP bytecode is made up of over 170 opcodes such as
ASSIGN
for assigning a value to a variable,
ECHO
for producing output, or JMPNZ
for
jumping to a specified address if a register value is not zero, for
instance.
The performance improvements of PHP 7 over PHP 5 stem to some extent from generating better bytecode.