PHP Deep Dive: How PHP Really Works

By Eric Koyanagi

Posted on 09/7/23

How does PHP really work? First, History Time!

One bit of trivia: no one actually cares that PHP stands for anything, like the often-stated "hypertext preprocessor". Does PHP really even stand for that? When originally "created" in 1993 (Chrome wasn't released until 2008, Google wasn't founded until 1998), it was basically the personal project of Rasmus Lerdorf, who had no intention of creating a programming language.

He wrote several common gateway interfaces in C, and used them to maintain his "personal home page". It was never designed to be a full programming language, but grew to become one.

It wasn't until 1997 that PHP became a "recursive acronym", the often-stated and utterly meaningless "hypertext preprocessor". PHP's history reminds me of a time when everything in tech had to be an acronym, even if that acronym was uselessly non-descriptive or complete nonsense.

I'm glad most modern languages dispense with useless acronyms in favor of actual names. There's a strange drive to acronym-ize everything -- a modern idea. Hell, the word "acronym" doesn't seem to appear in print until at least 1921. Of course it doesn't matter what "ASP" or "PHP" actually stands for because they are essentially proper nouns.

Does any of this obsession about acronyms matter to how PHP works? No, not really.

By 1999, PHP's core was rewritten with the release of the Zend Engine, a combination of the names of its creators (Zeev Suraski and Andi Gutmans). As of the writing of this article in 2023, the latest version of this engine is version 4, which backs PHP 8.

It's All C-Based in the End

Languages like PHP are "high-level" languages that are powerful and easy to use because they abstract away low-level details, enabling programmers to "do more with less code". Many developers with a general understanding of how PHP works know that the Zend Engine is a compiler and runtime engine that translates higher-level PHP code into something lower-level (via the C programming language).

Of course, the nitty gritty is more complex. The lifecycle of a PHP "execution" is not always simple and depends on if it is run in a process-based model or a thread-based model. As we learned previously, there's a difference between a process and a thread.

As we know (and can see from simply using "top"), Apache spawns many PHP processes. Each has its own independent memory, which we know it crucial for parallelism. Since each process runs in its own context, multi-core systems can indeed "do multiple things at once" (parallelism) despite PHP being single-threaded.

Of course, PHP doesn't have to run in single-threaded mode. That is the default that most developers are familiar with, but it's possible to run PHP in ZTS mode (Zend Thread Safe, yet more acronyms). This is especially relevant for Windows-based systems -- yes, some people run PHP on Windows.

This being said, according to Zeev himself it's far more robust and practical to use single-threaded PHP in the form of FPM/FastCGI. Quoting from this thread, he "can't imagine any situation where using ZTS inside of a Web Server context makes any sense at all". Considering that he's the one that started the ZTS project, let's believe him and ignore ZTS mode for now, as it is not as robust or performance as FPM.

We know the general idea, that PHP must "translate" our high level code into something the CPU can understand. Let's look at the general steps require to get there:

Files are "tokenized" by the engine
The engine builds a tree of symbols called an AST (Abstract Syntax Tree)
This tree is transformed into "opcodes". An opcode is low-level machine language that tells the CPU (or other processing unit) to "do something"

Basically, it parses the code, compiles, and executes instructions. It's written in the C programming language. Knowing a bit about its long history, we can understand why its written in something as low-level as C. This general flow isn't a big surprise, as languages like Node work in a similar way (but are implemented in C++ instead of C).

It's worth remembering that this process happens with every single request. If you were to "zoom out" on the web request lifecycle, these low level details help paint a picture of just how much work the server has to do to serve a request, especially with even higher levels of abstractions strapped on like the Laravel framework.

What is Tokenization, AST and OpCode in PHP?

Let's do a slightly deeper dive into how your PHP code is "translated".

The first step is tokenization, where your PHP code is broken down into the smallest possible units. For example, everywhere it sees an "if" keyword, it will translate that to "T_IF". If it sees an open parenthesis, it becomes "T_OPEN_PARENTHESIS". While this step is fairly simple, it begs the question "why"?

Tokenizers help improve the efficiency of the next steps of the parser, since it breaks code down into tokens that the subsequent steps can understand. You have to remember we're working at a very low level in "translating" source code into machine code -- understanding what a "bracket" does requires explicit character recognization logic. It isn't a given. Tokenizing can reduce that burden for the parser, aid in error detection, security, and can be used by IDEs to help highlight syntax or code.

The next step is building an AST. From the linked docs, we can see the authors clearly outline why that the AST is meant to serve "as an intermediary structure in our compilation process. This replaces the existing practice of emitting opcodes directly from the parser."

In other words, building the AST is the "first pass" in the compilation step. Is this wise, adding an extra step to the process when the "old practice" mentioned in the docs was emitting opcodes directly from the parser? Isn't fewer steps better? Not in this case.

If you think about this logically, you will need at least two passes in any compiler to work efficiently.

Imagine you are reading through a grocery list (real stretch of the imagination, I know).

The first entry is "one apple". Okay, great. Let's grab an apple and bag it! The next line is "one apple". Again. Well, crap. You're reading the list only once. You can't go back and change the first line to "two apples" and save yourself some trouble. You can't read the next entry, either -- when you're reading through a list only once, you have no context about the next entries. You can't possibly optimize that list without that context. You've little choice but to bag another apple, zoom across the store, only to bag a third apple way later.

People that work in compilers are probably thinking "that's a really dumb example", but hopefully it shows how you can't possibly optimize a list of "instructions" without having context. To have context, you need to "do stuff" with that list at least twice: you must first read it before you can optimize it.

Let's get back to the AST. An Abstract Syntax Tree is a general outline of the structure of the source code represented as a tree. The first word, "abstract" is important in the context of a two-stage compiler. This tree doesn't have every little detail of the source code because it doesn't need every little detail. As we know, this is an intermediary step, one designed to efficiently represent the source code structure.

The second phase of compilation uses the AST to create Zend VM Opcodes, which are interpreted by the underlying Zend engine. This is where compilation is completed and the underlying engine has everything it needs to send instructions to the CPU.

If you're really wanting some pain, you can review the Zend engine source code in C here, but ultimately this is the step that "does all the work". This executes the application logic and "gets a result". If there were a heart to all PHP processes, this is it.

So that's what OpCache is all about!

Zooming out, we can see there's a lot of steps going on even before the CPU "does the work" to carry out the application logic. This happens for every request. Now go back and review the bootstrapping process for Laravel and consider the entire request lifecycle. Holy bananas, there's a lot going on under the hood to handle a request and serve a response!

This leads us to why OpCache is useful, especially with its preloading feature (after 7.4). Knowing how compilation works, we know that if a request is able to "skip" tokenization and compilation and use OpCodes stored in memory, it will avoid a lot of hard work. That's exactly what Opcache does.

Yes, Varnish is great and can avoid a lot of problems, but it won't help for pages that still need to be dynamic (e.g. for logged in users). Opcache can improve performance by eliminating these steps. No tokenization. No building the AST. No second pass to create opcodes.

Even without preloading this is obviously very useful -- you don't have to preload if you don't want to. The only difference is that preloading avoids the compilation process for the first request to a given script by...well, preloading it.

If you're struggling with opcache preloading due to dependencies or load order...well, it's up to you how much to sweat for that one "first request" being slightly less optimized, especially since compilation time is not likely going to be a bottleneck in that context. There's bigger fish to fry.

There might be some misunderstanding about opcache working like an application-level cache. Yes, it improves performance by caching opcodes and eliminating a lot of redundant work by the compiler. No, it will not help your application logic itself go any faster beyond the time saved in compilation.

Conclusion

PHP wasn't really invented, it grew over time to serve a need for dynamic web pages. It still does this today, and it does it very well.

Still, PHP is a dynamic language and that means it pays a price for its flexibility. Understanding the process of tokenization, building an AST, and generating opcodes, we can deploy tools like opcache with much more confidence.

That's all I have to say for now, but hopefully you come back for more oh-so-exciting technical talk!

« Back to Article List