Understanding Code Python Compilation and Execution

December 08, 2024

Traditional Compilation Process (e.g., C/C++)

Phases of Compilation:

Preprocessing (if applicable):
In languages like C/C++, a preprocessor handles directives such as #include, #define, and conditional compilation directives before the compiler sees the code. The result is a single expanded source file.
Lexical Analysis & Parsing:
The compiler converts source code into tokens (lexing) and then checks the sequence of tokens against the language’s grammar (parsing). From this, it constructs an Abstract Syntax Tree (AST).
Semantic Analysis:
The compiler verifies type correctness, variable declarations, and function usage. It ensures the AST represents logically consistent code.
Intermediate Representation (IR) Generation:
After semantic checks, the compiler transforms the AST into a language- and machine-independent IR. This IR simplifies optimizations and is often in Static Single Assignment (SSA) form.
Optimizations:
The IR undergoes various optimizations such as constant folding, dead code elimination, and loop transformations to improve runtime performance.
Code Generation:
The optimized IR is then translated into assembly language specific to the target architecture. This includes instruction selection, register allocation, and instruction scheduling.
Assembly & Linking:
The assembler translates assembly code into machine code (object files). The linker then combines these object files along with required libraries into a final executable.

IR Example (LLVM):
A simple function int add(int a, int b) { return a + b; } might compile to LLVM IR instructions like:

define i32 @add(i32 %a, i32 %b) 
{ 
    entry:%sum = add i32 %a,%b   
    ret i32 %sum 
 }

This IR is a simplified, platform-independent representation of the code’s logic.

Python Compilation and Execution

Key Differences from Traditional Compiled Languages:

Python is commonly known as an interpreted language. However, it does have a compilation step—from source code (.py) to bytecode (.pyc).
Python does not produce native machine code by default. Instead, it generates bytecode, which runs on the Python Virtual Machine (PVM).

How Python Code Compiles:

From Source to AST:
When you run a Python script or import a module, Python first parses the source code. It checks syntax and creates an AST.
From AST to Bytecode:
The AST is compiled into Python bytecode, a lower-level, platform-independent instruction set for the PVM. The result is often cached in __pycache__ as .pyc files to speed up future runs.

Bytecode and the Virtual Machine:

Python’s bytecode is a series of instructions like LOAD_FAST, BINARY_ADD, CALL_FUNCTION, and RETURN_VALUE.
The Python Virtual Machine is a stack-based interpreter. It:
- Loads constants and variables onto an operand stack.
- Executes instructions by popping arguments off the stack, performing operations, and pushing results back.
Each bytecode instruction corresponds to a small piece of C code within the CPython interpreter. The interpreter loop fetches an opcode, dispatches to the handler code (already compiled into native machine instructions), and executes it.

No Direct Assembly Code Generation by Default:

The standard CPython interpreter itself is compiled from C source code into a native binary by your system’s C compiler before you run Python programs.
The PVM reads and executes the Python bytecode instructions, but does not translate them into machine code at runtime.
Alternative Python implementations or tools (like PyPy or Numba) may introduce Just-In-Time (JIT) compilation, generating machine code on the fly for performance-critical sections. However, this is not the standard behavior of CPython.

Import Mechanism in Python

How import Works:

Locating the Module:
On encountering import mymodule, Python searches the directories in sys.path for mymodule.py or a suitable .pyc file.
Compilation if Needed:
- If a .pyc file is present and up-to-date, Python uses it directly.
- If not, Python compiles mymodule.py into bytecode and stores it in __pycache__ next to the source file.
Module Execution and Caching:
- The bytecode is executed once, running top-level statements and definitions.
- The resulting module object is cached in sys.modules.
  Any subsequent imports of the same module (even from different source files) will reuse the existing module object and .pyc file. There is no duplication of compiled code in separate __pycache__ directories for different importer files.

Multiple Imports of the Same Module:

If two different Python files import the same module, the module is compiled at most once.
The compiled .pyc file resides in the module’s own __pycache__ directory.
All importers share that single compiled bytecode cache. No extra .pyc files are created in the importing files’ directories.

Summary

Traditional Compilers (C/C++, etc.):
Convert source code all the way down to machine code via IR and optimization steps, producing a standalone binary.
Python (CPython):
Compiles .py files into .pyc bytecode files, which are executed by the Python Virtual Machine. The VM is implemented in C, and is already compiled into native machine code. Python bytecode instructions serve as data for this pre-compiled interpreter loop, not as instructions that get translated further into assembly during runtime.
Imports in Python:
Modules are compiled once into .pyc files. Subsequent imports of the same module, from anywhere, use the single cached bytecode.

This architecture makes Python extremely flexible and portable, at the cost of potentially slower execution than natively compiled languages. Tools like PyPy or Cython can bridge that performance gap by introducing JIT or static compilation techniques.