How to Understand a Codebase You Didnt Write
You just opened a 5000-line file with no comments. The guy who wrote it quit six months ago. Youre screwed—unless you know how to dig. This isnt about fixing bugs. Its about switching your brain from developer to detective. Forget what the code should do. Start with what it actually does. That mental shift—from builder to forensic analyst—is the only thing standing between you and a full week of guessing.
The technical term for what youre dealing with is Architectural Erosion. Features piled on features, workarounds wrapped in other workarounds, business logic bleeding into view logic. Nobody planned it this way. It just grew, like mold. Your job isnt to judge the previous dev—your job is to map the crime scene before you accidentally contaminate it.
Legacy code documentation
Heres the uncomfortable truth: the README.md is probably a lie. Not maliciously—it just hasnt been touched since the initial commit two years ago. It describes a system that no longer exists. Youll find references to a config file that was renamed, a deploy process that was replaced, and a database schema that evolved three times without anyone updating a single line of documentation. Treat the README as a historical artifact, not a manual.
The solution isnt to find better docs. Its to write your own as you go. Open a fresh file—call it discovery.md—and start logging everything you figure out. Not code, not fixes. Observations. This function is called from three different controllers but only two of them pass the correct argument type. The config values in settings.php are partially overridden by environment variables that arent documented anywhere. Raw, unfiltered field notes. This is your discovery log, and it will save you—and every developer after you—hours of re-investigation.
# discovery.md — started 2024-03-11
## auth/session.php
- SessionManager::start() is called before DB connection in some flows
- Magic number 1800 (line 214) = session timeout in seconds (not documented)
- $user_level compared against hardcoded integer 3 — no enum, no constant
- Side effect: writes to $_SESSION AND updates `users` table simultaneously
## Unclear
- Who calls cleanup_expired() ? Found no cron reference in codebase.
- "legacy_mode" flag in config — appears to disable half the validation layer
What this log tells you about the system
When your discovery log starts filling up with magic numbers and undocumented flags, youre looking at a system that grew faster than its abstractions. The legacy_mode flag is a classic code smell — someone patched around broken functionality instead of fixing it, then left the patch as permanent infrastructure. The dual write in SessionManager::start() is a side effect waiting to cause a race condition. These arent just notes. Theyre a map of where the technical debt is load-bearing.
Identifying entry points
Before you can trace anything, you need the head of the snake. In a PHP project, thats usually index.php. In a Node app, its whatever package.json points to in the main field, or the file that bootstraps Express. In a legacy monolith, it might be a router file thats 800 lines long and does seventeen different things. Find it first.
Use Ctrl+Shift+F in VS Code or raw grep in the terminal to chase a specific UI element back to its source. Say a button triggers a payment. Search for the buttons ID or class name, find the JS event handler, find the API endpoint it calls, find the controller method that handles that route. Youre following a wire from the wall socket back to the power plant. Each hop is an entry point into a new layer of the system. Document every hop in your discovery log.
# Terminal — grep-based entry point tracing
# 1. Find where the frontend calls the payment endpoint
grep -rn "processPayment\|process-payment\|/api/pay" ./src/
# 2. Found: assets/js/checkout.js line 88
# $.post('/api/payment/process', formData)
# 3. Find the router registration
grep -rn "payment/process\|PaymentController" ./app/
# 4. Found: routes/api.php line 204
# Route::post('/payment/process', [PaymentController::class, 'process']);
# 5. Now open PaymentController::process() — that's your real entry point
What this grep trail reveals
Following the wire manually like this exposes something static analysis tools often miss: the gap between where the code looks like it starts and where it actually starts. In a legacy codebase, youll often find that the router has two conflicting registrations for the same path — one in the main routes file, one buried in a service provider that loads conditionally. The one that wins depends on load order, and nobody remembers which one that is. Your grep trail doesnt just find the entry point. It finds the landmines around it.
Mapping software dependencies
Functions in a legacy system dont work in isolation. They pull data from the DB, reshape it, pass it to another function which reshapes it again, which passes it to a template that expects a completely different shape. The data moves like a hot potato—nobody holds it long enough to own it, and everyone assumes someone else validated it upstream. This is dependency hell in its natural habitat.
Start by drawing—literally, on paper or in a tool like draw.io—a rough map of how data flows between the major components. Which functions hit the database directly? Which ones receive pre-processed arrays? Where are the hardcoded links? Hardcoded file paths and URLs are especially brutal: include('/var/www/html/includes/config.php') works exactly once, in exactly one environment, and breaks silently everywhere else. Static analysis tools like PHPStan, ESLint, or even a well-tuned grep/regex search for absolute paths will surface these faster than manual reading.
// Classic dependency hell — real-world pattern
function getUserDashboardData($userId) {
// Direct DB query — no repository, no abstraction
$user = mysqli_query($conn, "SELECT * FROM users WHERE id = $userId");
// Hardcoded path — breaks outside /var/www/html
include('/var/www/html/includes/permissions.php');
// Global scope contamination — $permissions set by included file
if ($permissions['level'] >= 3) {
// Calls external function that also queries the DB independently
$stats = generateUserStats($userId); // <-- another DB hit, no caching } // Returns mixed array — caller has no idea what shape this is return array_merge($user, $stats ?? [], ['timestamp' => time()]);
}
Reading the dependency map
This single function is a case study in global scope contamination. The included file silently populates $permissions into the surrounding scope — if that include fails, the condition evaluates against an undefined variable. The double DB hit inside generateUserStats() performance bottleneck that only shows up under load. And that array_merge at the end? The caller is getting a blob of data with no schema. Every piece of downstream code that consumes this return value is one refactor away from breaking in ways that are nearly impossible to predict without a full dependency map.
Trace function calls without debugger
Xdebug isnt configured. DevTools keeps losing the source map. The staging server logs are write-protected and the admin is on vacation. Welcome to the real world. You still have to figure out what this function actually does when it runs.
The manual approach: pick a variable. Any variable. Follow it from where its first assigned — often deep in the global scope — all the way down to where its returned, echoed, or written to the database. This is the trail of breadcrumbs. At each step, ask: what could modify this variable between here and there? Watch for array keys being silently overwritten, type coercions that turn a string into zero, and conditional branches that only activate in production because they check an environment variable that doesnt exist locally. Add temporary var_dump() or console.log() statements at each waypoint. Ugly? Yes. Effective? Completely.
// Tracing $token through a legacy auth flow — manual breadcrumb method
// Step 1: Origin — where does $token come from?
$token = $_POST['auth_token'] ?? $_COOKIE['session_token'] ?? null;
// var_dump($token); // <-- breadcrumb #1
// Step 2: First transformation
$token = trim(htmlspecialchars($token)); // encoding may corrupt base64 tokens
// var_dump($token); // <-- breadcrumb #2
// Step 3: Passed into function — does it arrive intact?
function validateToken($token) {
// var_dump($token); // <-- breadcrumb #3
global $config; // <-- global scope dependency, invisible from outside
return hash_hmac('sha256', $token, $config['secret']) === $token;
// Bug: comparing HMAC output against original input — always false
}
What the breadcrumb trail exposes
The bug in validateToken() would never appear in a code review — it looks plausible at a glance. The HMAC comparison is backwards, and its been broken since it was written. The function always returns false, which means whatever call site checks its return value has a fallback path thats been silently handling all authentication for an unknown period of time. Without the manual trace, without that breadcrumb at step three, youd never find this. The call stack doesnt lie, but you have to walk it yourself.
Isolated code testing
You found a function that looks responsible for the bug. But its wired into six other systems. Running it in place risks a domino effect — one wrong input and youve cascaded a failure across the entire session layer. The sandbox method exists for exactly this situation.
Rip the function out. Physically. Copy it into a clean test_isolated.php or test_isolated.js file. Strip out every dependency you can — replace DB calls with hardcoded mock data, replace external API calls with static return values. Feed it the exact input you think is causing the problem. This is unit testing in field conditions, without a framework, without a test runner. Its crude and it works. The goal isnt coverage. The goal is to find the infection without crashing the server. Once you confirm the function misbehaves in isolation, youve localized the problem. Now you can fix it without touching everything else.
// test_isolated.php — sandbox for calculateDiscount()
// Mock dependencies — no DB, no session, no global config
$mockUser = [
'id' => 42,
'account_type' => 'premium',
'purchase_history_count' => 15,
'region' => 'EU'
];
$mockCartTotal = 250.00;
// Paste the isolated function directly — zero external dependencies
function calculateDiscount($user, $cartTotal) {
if ($user['account_type'] === 'premium' && $cartTotal > 200) {
$rate = 0.15;
} elseif ($user['purchase_history_count'] > 10) {
$rate = 0.10;
} else {
$rate = 0.0;
}
// Bug hunting: EU users should get additional 5% — is that handled?
return round($cartTotal * $rate, 2);
}
$result = calculateDiscount($mockUser, $mockCartTotal);
var_dump($result);
// Expected for EU premium user with 15 purchases: 37.50 (15%) + 5% EU bonus
// Actual: 37.50 — EU bonus is missing entirely
What isolation tells you that the system hides
Running the function in a sandbox immediately surfaces the missing EU discount logic — something that in the live system is obscured by the noise of session data, database round-trips, and the four other places where discounts might theoretically be applied. The sandbox proves the bug exists in this function, not somewhere upstream. That distinction matters enormously before you start refactoring: its the difference between a targeted fix and a blind excavation. When you find the infection in isolation, you operate with a scalpel instead of a shovel.
Youre not here to fix everything. Youre here to leave the code slightly less broken than you found it—and to document what you learned so the next person doesnt start from zero. Pick one thing thats causing active pain, isolate it, fix it cleanly, and write a comment that explains why the fix works, not just what it does. Thats your contribution to reducing the technical debt, one deliberate cut at a time. The codebase will outlive your sprint, your contract, maybe your tenure at the company. The notes you leave behind are the only thing standing between the next developer and the same week of confusion you just survived.
Written by: