561 lines
17 KiB
Markdown
561 lines
17 KiB
Markdown
# Berry Repository Deep Architecture Analysis
|
|
|
|
## Executive Summary
|
|
Berry is a sophisticated embedded scripting language with a register-based virtual machine, one-pass compiler, and mark-sweep garbage collector. The architecture prioritizes memory efficiency and performance for embedded systems while maintaining full dynamic language capabilities.
|
|
|
|
---
|
|
|
|
## 1. CORE VIRTUAL MACHINE ARCHITECTURE
|
|
|
|
### 1.1 Virtual Machine Structure (`be_vm.h`, `be_vm.c`)
|
|
|
|
```c
|
|
struct bvm {
|
|
bglobaldesc gbldesc; // Global variable management
|
|
bvalue *stack; // Register stack (not call stack!)
|
|
bvalue *stacktop; // Stack boundary
|
|
bupval *upvalist; // Open upvalue chain for closures
|
|
bstack callstack; // Function call frames
|
|
bstack exceptstack; // Exception handling stack
|
|
bcallframe *cf; // Current call frame
|
|
bvalue *reg; // Current function base register
|
|
bvalue *top; // Current function top register
|
|
binstruction *ip; // Instruction pointer
|
|
struct bgc gc; // Garbage collector state
|
|
// ... performance counters, hooks, etc.
|
|
};
|
|
```
|
|
|
|
**Key Architectural Decisions:**
|
|
- **Register-based VM**: Unlike stack-based VMs (Python, Java), Berry uses registers for better performance
|
|
- **Unified Value System**: All values use `bvalue` structure with type tagging
|
|
- **Integrated GC**: Garbage collector is tightly integrated with VM execution
|
|
|
|
### 1.2 Value System (`be_object.h`)
|
|
|
|
```c
|
|
typedef struct bvalue {
|
|
union bvaldata v; // Value data (int, real, pointer, etc.)
|
|
int type; // Type tag (BE_INT, BE_STRING, etc.)
|
|
} bvalue;
|
|
```
|
|
|
|
**Type Hierarchy:**
|
|
```
|
|
Basic Types (not GC'd):
|
|
├── BE_NIL (0) - null value
|
|
├── BE_INT (1) - integer numbers
|
|
├── BE_REAL (2) - floating point
|
|
├── BE_BOOL (3) - boolean values
|
|
├── BE_COMPTR (4) - common pointer
|
|
└── BE_FUNCTION (6) - function reference
|
|
|
|
GC Objects (BE_GCOBJECT = 16):
|
|
├── BE_STRING (16) - string objects
|
|
├── BE_CLASS (17) - class definitions
|
|
├── BE_INSTANCE (18) - class instances
|
|
├── BE_PROTO (19) - function prototypes
|
|
├── BE_LIST (20) - dynamic arrays
|
|
├── BE_MAP (21) - hash tables
|
|
├── BE_MODULE (22) - module objects
|
|
└── BE_COMOBJ (23) - common objects
|
|
```
|
|
|
|
**Performance Optimization:**
|
|
- Simple types (int, bool, real) stored by value → no allocation overhead
|
|
- Complex types stored by reference → enables sharing and GC
|
|
- Type checking via bit manipulation for speed
|
|
|
|
---
|
|
|
|
## 2. COMPILATION SYSTEM
|
|
|
|
### 2.1 Lexical Analysis (`be_lexer.c`, `be_lexer.h`)
|
|
|
|
**Token Processing Pipeline:**
|
|
```
|
|
Source Code → Lexer → Token Stream → Parser → AST → Code Generator → Bytecode
|
|
```
|
|
|
|
**Key Features:**
|
|
- **One-pass compilation**: No separate AST construction phase
|
|
- **Integrated string interning**: Strings deduplicated during lexing
|
|
- **Error recovery**: Continues parsing after syntax errors
|
|
|
|
### 2.2 Parser (`be_parser.c`, `be_parser.h`)
|
|
|
|
**Expression Descriptor System:**
|
|
```c
|
|
typedef struct {
|
|
union {
|
|
struct { /* for suffix expressions */
|
|
unsigned int idx:9; // RK index (register/constant)
|
|
unsigned int obj:9; // object RK index
|
|
unsigned int tt:5; // object type
|
|
} ss;
|
|
breal r; // for real constants
|
|
bint i; // for integer constants
|
|
bstring *s; // for string constants
|
|
int idx; // variable index
|
|
} v;
|
|
int t, f; // true/false jump patch lists
|
|
bbyte not; // logical NOT flag
|
|
bbyte type; // expression type (ETLOCAL, ETGLOBAL, etc.)
|
|
} bexpdesc;
|
|
```
|
|
|
|
**Expression Types:**
|
|
- `ETLOCAL`: Local variables (register-allocated)
|
|
- `ETGLOBAL`: Global variables (by index)
|
|
- `ETUPVAL`: Upvalues (closure variables)
|
|
- `ETMEMBER`: Object member access (`obj.member`)
|
|
- `ETINDEX`: Array/map indexing (`obj[key]`)
|
|
- `ETREG`: Temporary registers
|
|
|
|
### 2.3 Code Generation (`be_code.c`)
|
|
|
|
**Bytecode Instruction Format:**
|
|
```
|
|
32-bit instruction = [8-bit opcode][24-bit parameters]
|
|
|
|
Parameter formats:
|
|
- A, B, C: 8-bit register/constant indices
|
|
- Bx: 16-bit constant index
|
|
- sBx: 16-bit signed offset (jumps)
|
|
```
|
|
|
|
**Register Allocation Strategy:**
|
|
- **Local variables**: Allocated to specific registers for function lifetime
|
|
- **Temporaries**: Allocated/freed dynamically during expression evaluation
|
|
- **Constants**: Stored in constant table, accessed via K(index)
|
|
|
|
---
|
|
|
|
## 3. MEMORY MANAGEMENT SYSTEM
|
|
|
|
### 3.1 Garbage Collection (`be_gc.c`, `be_gc.h`)
|
|
|
|
**Mark-Sweep Algorithm:**
|
|
```c
|
|
struct bgc {
|
|
bgcobject *list; // All GC objects
|
|
bgcobject *gray; // Gray objects (mark phase)
|
|
bgcobject *fixed; // Fixed objects (never collected)
|
|
struct gc16_t* pool16; // Small object pool (≤16 bytes)
|
|
struct gc32_t* pool32; // Medium object pool (17-32 bytes)
|
|
size_t usage; // Current memory usage
|
|
size_t threshold; // GC trigger threshold
|
|
bbyte steprate; // Threshold growth rate
|
|
bbyte status; // GC state
|
|
};
|
|
```
|
|
|
|
**GC Object Header:**
|
|
```c
|
|
#define bcommon_header \
|
|
struct bgcobject *next; \ // Linked list pointer
|
|
bbyte type; \ // Object type
|
|
bbyte marked // GC mark bits
|
|
```
|
|
|
|
**Tri-color Marking:**
|
|
- **White**: Unreachable (will be collected)
|
|
- **Gray**: Reachable but children not yet marked
|
|
- **Dark**: Reachable and children marked
|
|
|
|
**Memory Pools:**
|
|
- **Small objects (≤16 bytes)**: Pooled allocation for strings, small objects
|
|
- **Medium objects (17-32 bytes)**: Separate pool for medium objects
|
|
- **Large objects**: Direct malloc/free
|
|
|
|
### 3.2 String Management (`be_string.c`, `be_string.h`)
|
|
|
|
**String Interning System:**
|
|
```c
|
|
struct bstringtable {
|
|
bstring **table; // Hash table of interned strings
|
|
int size; // Table size
|
|
int count; // Number of strings
|
|
};
|
|
```
|
|
|
|
**String Types:**
|
|
- **Short strings**: Interned in global table, shared across VM
|
|
- **Long strings**: Not interned, individual objects
|
|
- **Constant strings**: Embedded in bytecode, never collected
|
|
|
|
---
|
|
|
|
## 4. BUILT-IN LIBRARY ARCHITECTURE
|
|
|
|
### 4.1 JSON Library (`be_jsonlib.c`) - **SECURITY CRITICAL**
|
|
|
|
**Recent Security Enhancements:**
|
|
```c
|
|
// Safe Unicode string length calculation
|
|
static size_t json_strlen_safe(const char *str, size_t len) {
|
|
size_t result = 0;
|
|
const char *end = str + len;
|
|
|
|
while (str < end) {
|
|
if (*str == '\\' && str + 1 < end) {
|
|
if (str[1] == 'u') {
|
|
// Unicode escape: \uXXXX → 1-3 UTF-8 bytes
|
|
result += 3; // Conservative allocation
|
|
str += 6; // Skip \uXXXX
|
|
} else {
|
|
result += 1; // Other escapes → 1 byte
|
|
str += 2; // Skip \X
|
|
}
|
|
} else {
|
|
result += 1;
|
|
str += 1;
|
|
}
|
|
}
|
|
return result;
|
|
}
|
|
```
|
|
|
|
**Security Features:**
|
|
- **Buffer overflow protection**: Proper size calculation for Unicode sequences
|
|
- **Input validation**: Rejects malformed Unicode and control characters
|
|
- **Memory limits**: MAX_JSON_STRING_LEN prevents memory exhaustion
|
|
- **Comprehensive testing**: 10 security test functions covering edge cases
|
|
|
|
### 4.2 Native Function Interface
|
|
|
|
**Function Registration:**
|
|
```c
|
|
typedef int (*bntvfunc)(bvm *vm);
|
|
|
|
// Native function descriptor
|
|
typedef struct {
|
|
const char *name;
|
|
bntvfunc function;
|
|
} bnfuncinfo;
|
|
```
|
|
|
|
**Calling Convention:**
|
|
- Arguments passed via VM stack
|
|
- Return values via `be_return()` or `be_returnvalue()`
|
|
- Error handling via exceptions
|
|
|
|
---
|
|
|
|
## 5. ADVANCED LANGUAGE FEATURES
|
|
|
|
### 5.1 Closure Implementation (`be_func.c`)
|
|
|
|
**Upvalue Management:**
|
|
```c
|
|
typedef struct bupval {
|
|
bcommon_header;
|
|
bvalue *value; // Points to stack slot or own storage
|
|
union {
|
|
bvalue val; // Closed upvalue storage
|
|
struct bupval *next; // Open upvalue chain
|
|
} u;
|
|
} bupval;
|
|
```
|
|
|
|
**Closure Lifecycle:**
|
|
1. **Open upvalues**: Point to stack slots of parent function
|
|
2. **Closing**: When parent function returns, copy values to upvalue storage
|
|
3. **Shared upvalues**: Multiple closures can share same upvalue
|
|
|
|
### 5.2 Class System (`be_class.c`)
|
|
|
|
**Class Structure:**
|
|
```c
|
|
typedef struct bclass {
|
|
bcommon_header;
|
|
bstring *name; // Class name
|
|
bclass *super; // Superclass (single inheritance)
|
|
bmap *members; // Instance methods and variables
|
|
bmap *nvar; // Native variables
|
|
// ... method tables, constructors, etc.
|
|
} bclass;
|
|
```
|
|
|
|
**Method Resolution:**
|
|
1. Check instance methods
|
|
2. Check class methods
|
|
3. Check superclass (recursive)
|
|
4. Check native methods
|
|
|
|
### 5.3 Module System (`be_module.c`)
|
|
|
|
**Module Loading Pipeline:**
|
|
```
|
|
Module Name → Path Resolution → File Loading → Compilation → Caching → Execution
|
|
```
|
|
|
|
**Module Types:**
|
|
- **Script modules**: `.be` files compiled to bytecode
|
|
- **Bytecode modules**: Pre-compiled `.bec` files
|
|
- **Native modules**: Shared libraries (`.so`, `.dll`)
|
|
|
|
---
|
|
|
|
## 6. PERFORMANCE OPTIMIZATIONS
|
|
|
|
### 6.1 Register-Based VM Benefits
|
|
|
|
**Comparison with Stack-Based VMs:**
|
|
```
|
|
Stack-based (Python): Register-based (Berry):
|
|
LOAD_FAST 0 # Already in register
|
|
LOAD_FAST 1 ADD R0, R1, R2
|
|
BINARY_ADD # Single instruction
|
|
STORE_FAST 2
|
|
```
|
|
|
|
**Advantages:**
|
|
- **Fewer instructions**: Direct register operations
|
|
- **Better locality**: Registers stay in CPU cache
|
|
- **Reduced stack manipulation**: No push/pop overhead
|
|
|
|
### 6.2 Constant Folding and Optimization
|
|
|
|
**Compile-time Optimizations:**
|
|
- **Constant folding**: `2 + 3` → `5` at compile time
|
|
- **Jump optimization**: Eliminate redundant jumps
|
|
- **Register reuse**: Minimize register allocation
|
|
|
|
### 6.3 Memory Pool Allocation
|
|
|
|
**Small Object Optimization:**
|
|
- **Pool allocation**: Reduces malloc/free overhead
|
|
- **Size classes**: 16-byte and 32-byte pools
|
|
- **Batch allocation**: Allocate multiple objects at once
|
|
|
|
---
|
|
|
|
## 7. SECURITY ARCHITECTURE
|
|
|
|
### 7.1 Memory Safety
|
|
|
|
**Buffer Overflow Protection:**
|
|
- **Bounds checking**: All array/string accesses validated
|
|
- **Size calculation**: Conservative memory allocation
|
|
- **Input validation**: Reject malformed input early
|
|
|
|
**Integer Overflow Protection:**
|
|
- **Size limits**: Maximum object sizes enforced
|
|
- **Wraparound detection**: Check for arithmetic overflow
|
|
- **Safe arithmetic**: Use checked operations where needed
|
|
|
|
### 7.2 Sandboxing Capabilities
|
|
|
|
**Resource Limits:**
|
|
- **Memory limits**: Configurable heap size limits
|
|
- **Execution limits**: Instruction count limits
|
|
- **Stack limits**: Prevent stack overflow attacks
|
|
|
|
**API Restrictions:**
|
|
- **Selective module loading**: Control which modules are available
|
|
- **Function filtering**: Restrict dangerous native functions
|
|
- **File system access**: Configurable file access permissions
|
|
|
|
---
|
|
|
|
## 8. TESTING AND QUALITY ASSURANCE
|
|
|
|
### 8.1 Test Suite Architecture
|
|
|
|
**Test Categories:**
|
|
```
|
|
Unit Tests (51 total):
|
|
├── Language Features (15 tests)
|
|
│ ├── assignment.be, bool.be, class.be
|
|
│ ├── closure.be, function.be, for.be
|
|
│ └── vararg.be, cond_expr.be, exceptions.be
|
|
├── Data Types (12 tests)
|
|
│ ├── list.be, map.be, range.be
|
|
│ ├── string.be, int.be, bytes.be
|
|
│ └── int64.be, bytes_fixed.be, bytes_b64.be
|
|
├── Libraries (8 tests)
|
|
│ ├── json.be (9168 lines - comprehensive security tests)
|
|
│ ├── math.be, os.be, debug.be
|
|
│ └── introspect.be, re.be, time.be
|
|
├── Parser/Compiler (6 tests)
|
|
│ ├── parser.be, lexer.be, compiler.be
|
|
│ └── suffix.be, lexergc.be, reference.be
|
|
└── Advanced Features (10 tests)
|
|
├── virtual_methods.be, super_auto.be
|
|
├── class_static.be, division_by_zero.be
|
|
└── compound.be, member_indirect.be
|
|
```
|
|
|
|
### 8.2 Security Testing
|
|
|
|
**JSON Security Test Suite (10 functions):**
|
|
1. **Unicode expansion buffer overflow protection**
|
|
2. **Invalid Unicode sequence rejection**
|
|
3. **Control character validation**
|
|
4. **Invalid escape sequence rejection**
|
|
5. **String length limits**
|
|
6. **Mixed Unicode and ASCII handling**
|
|
7. **Edge case coverage**
|
|
8. **Malformed JSON string handling**
|
|
9. **Nested Unicode stress testing**
|
|
10. **Security regression prevention**
|
|
|
|
---
|
|
|
|
## 9. BUILD AND DEPLOYMENT SYSTEM
|
|
|
|
### 9.1 Build Configuration
|
|
|
|
**Configuration System (`berry_conf.h`):**
|
|
```c
|
|
// Memory configuration
|
|
#define BE_STACK_TOTAL_MAX 2000 // Maximum stack size
|
|
#define BE_STACK_FREE_MIN 20 // Minimum free stack
|
|
|
|
// Feature toggles
|
|
#define BE_USE_PERF_COUNTERS 0 // Performance monitoring
|
|
#define BE_USE_DEBUG_GC 0 // GC debugging
|
|
#define BE_USE_SCRIPT_COMPILER 1 // Include compiler
|
|
|
|
// Integer type selection
|
|
#define BE_INTGER_TYPE 1 // 0=int, 1=long, 2=long long
|
|
```
|
|
|
|
### 9.2 Cross-Platform Support
|
|
|
|
**Platform Abstraction (`be_port.c`):**
|
|
- **File I/O**: Unified file operations across platforms
|
|
- **Time functions**: Platform-specific time implementation
|
|
- **Memory allocation**: Custom allocator hooks
|
|
- **Thread safety**: Optional threading support
|
|
|
|
### 9.3 Code Generation Tools (`tools/coc/`)
|
|
|
|
**Compile-on-Command System:**
|
|
- **String table generation**: Pre-compute string constants
|
|
- **Module compilation**: Convert `.be` files to C arrays
|
|
- **Constant optimization**: Merge duplicate constants
|
|
- **Size optimization**: Minimize memory footprint
|
|
|
|
---
|
|
|
|
## 10. PERFORMANCE CHARACTERISTICS
|
|
|
|
### 10.1 Memory Footprint
|
|
|
|
**Interpreter Core:**
|
|
- **Minimum size**: <40KiB executable
|
|
- **Runtime heap**: <4KiB minimum (ARM Cortex M4)
|
|
- **Per-VM overhead**: ~1-2KiB for VM state
|
|
- **GC overhead**: ~10-20% of allocated objects
|
|
|
|
### 10.2 Execution Performance
|
|
|
|
**Benchmark Characteristics:**
|
|
- **Function calls**: ~2-5x slower than C
|
|
- **Arithmetic**: ~3-10x slower than C
|
|
- **String operations**: Competitive due to interning
|
|
- **Object creation**: Fast due to pooled allocation
|
|
|
|
### 10.3 Compilation Speed
|
|
|
|
**Compilation Performance:**
|
|
- **One-pass**: No separate AST construction
|
|
- **Incremental**: Can compile individual functions
|
|
- **Memory efficient**: Minimal intermediate storage
|
|
- **Error recovery**: Continues after syntax errors
|
|
|
|
---
|
|
|
|
## 11. EXTENSIBILITY AND EMBEDDING
|
|
|
|
### 11.1 C API Design (`be_api.c`)
|
|
|
|
**API Categories:**
|
|
```c
|
|
// VM lifecycle
|
|
bvm* be_vm_new(void);
|
|
void be_vm_delete(bvm *vm);
|
|
|
|
// Script execution
|
|
int be_loadstring(bvm *vm, const char *str);
|
|
int be_pcall(bvm *vm, int argc);
|
|
|
|
// Stack manipulation
|
|
void be_pushnil(bvm *vm);
|
|
void be_pushint(bvm *vm, bint value);
|
|
bint be_toint(bvm *vm, int index);
|
|
|
|
// Native function registration
|
|
void be_regfunc(bvm *vm, const char *name, bntvfunc f);
|
|
```
|
|
|
|
### 11.2 Native Module Development
|
|
|
|
**Module Registration Pattern:**
|
|
```c
|
|
static int my_function(bvm *vm) {
|
|
int argc = be_top(vm);
|
|
if (argc >= 1 && be_isint(vm, 1)) {
|
|
bint value = be_toint(vm, 1);
|
|
be_pushint(vm, value * 2);
|
|
be_return(vm);
|
|
}
|
|
be_return_nil(vm);
|
|
}
|
|
|
|
static const bnfuncinfo functions[] = {
|
|
{ "my_function", my_function },
|
|
{ NULL, NULL }
|
|
};
|
|
|
|
int be_open_mymodule(bvm *vm) {
|
|
be_regfunc(vm, "my_function", my_function);
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 12. FUTURE ARCHITECTURE CONSIDERATIONS
|
|
|
|
### 12.1 Potential Optimizations
|
|
|
|
**JIT Compilation:**
|
|
- **Hot path detection**: Identify frequently executed code
|
|
- **Native code generation**: Compile to machine code
|
|
- **Deoptimization**: Fall back to interpreter when needed
|
|
|
|
**Advanced GC:**
|
|
- **Generational GC**: Separate young/old generations
|
|
- **Incremental GC**: Spread collection across multiple cycles
|
|
- **Concurrent GC**: Background collection threads
|
|
|
|
### 12.2 Security Enhancements
|
|
|
|
**Enhanced Sandboxing:**
|
|
- **Capability-based security**: Fine-grained permission system
|
|
- **Resource quotas**: CPU time, memory, I/O limits
|
|
- **Audit logging**: Track security-relevant operations
|
|
|
|
**Cryptographic Support:**
|
|
- **Secure random numbers**: Cryptographically secure PRNG
|
|
- **Hash functions**: Built-in SHA-256, etc.
|
|
- **Encryption**: Symmetric/asymmetric crypto primitives
|
|
|
|
---
|
|
|
|
## CONCLUSION
|
|
|
|
Berry represents a sophisticated balance between simplicity and functionality. Its register-based VM, one-pass compiler, and integrated garbage collector provide excellent performance for embedded systems while maintaining the flexibility of a dynamic language. The recent security enhancements, particularly in JSON parsing, demonstrate a commitment to production-ready robustness.
|
|
|
|
The architecture's key strengths are:
|
|
- **Memory efficiency**: Minimal overhead for embedded deployment
|
|
- **Performance**: Register-based VM with optimized execution
|
|
- **Security**: Comprehensive input validation and buffer protection
|
|
- **Extensibility**: Clean C API for native integration
|
|
- **Maintainability**: Well-structured codebase with comprehensive testing
|
|
|
|
This deep analysis provides the foundation for understanding any aspect of Berry's implementation, from low-level VM details to high-level language features.
|