Robust Error Handling: A Structured Framework For Nexus
Hey guys! Let's dive into a critical aspect of software development: error handling. Specifically, we're going to explore a proposal for a structured error handling framework within the Nexus project. Currently, error handling is a bit all over the place, leading to inconsistencies and making debugging harder than it needs to be. This article breaks down the problems, proposes a solution, and outlines a migration path. Let's get started!
Summary
The current error handling system in Nexus suffers from fragmentation. We see inconsistent exception types being used, and crucial context is often missing when errors occur. This makes it difficult to diagnose and resolve issues effectively. A structured approach is needed to bring order to the chaos.
Current State
Take a look inside src/nexus/core/exceptions.py. You'll find more than 10 exception types defined. That sounds comprehensive, right? However, the problem isn't the number of exceptions, but rather the inconsistent way they are used throughout the codebase. This inconsistency leads to confusion and makes it harder to maintain the code.
Problems
Let's break down the specific issues we're facing with the current error handling implementation.
1. Missing Context
This is a big one! Often, when an exception is raised, there's not enough information to understand what actually went wrong. Consider this example:
# Bad: No useful context
raise NexusError("Failed")
This tells us something failed, but it's completely useless for debugging. What failed? Where did it fail? Why did it fail? We need more context!
To properly diagnose an error, we need to know the specifics. For instance, if a file operation failed, we should know the file path, the type of operation being performed (read, write, etc.), and any relevant backend information. Without this context, we're essentially flying blind.
Imagine trying to troubleshoot a problem reported by a user. They tell you, "Something went wrong!" That's not very helpful, is it? Similarly, a generic error message like "Failed" leaves developers struggling to pinpoint the root cause of the issue. We need to provide enough information so that developers can quickly understand the problem and take appropriate action.
To solve this, we should enrich our exceptions with relevant details such as the path of the file being processed, the specific operation that failed (e.g., read, write, delete), and any other contextual information that could help in debugging. This might include user IDs, timestamps, or other relevant data points. By including this information in the exception, we make it much easier to understand what went wrong and how to fix it.
2. No Exception Hierarchy
Currently, exceptions lack a proper hierarchy. This means that they all inherit directly from the base Exception class, rather than being organized into a meaningful structure. This makes it difficult to catch specific types of errors and handle them appropriately.
# All exceptions inherit from Exception directly
class NexusError(Exception): ...
class NexusFileNotFoundError(Exception): ...
class BackendError(Exception): ...
Ideally, we want to establish an "IS-A" relationship between exceptions. For example, a NexusFileNotFoundError is a type of NexusError. This allows us to catch NexusError and handle all its subclasses, or catch a specific subclass for more targeted error handling.
# Should be:
class NexusError(Exception): ...
class NexusFileNotFoundError(NexusError): ... # IS-A NexusError
class BackendError(NexusError): ...
By creating a well-defined exception hierarchy, we can write more robust and maintainable code. We can catch exceptions at different levels of granularity, allowing us to handle errors in a more flexible and targeted manner. This also makes it easier to add new exception types in the future without breaking existing code.
Think of it like organizing files on your computer. You don't just dump everything into one giant folder, do you? You create a hierarchy of folders and subfolders to keep things organized. The same principle applies to exceptions. A well-structured hierarchy makes it easier to find and handle the specific errors you're interested in.
3. Inconsistent Error Messages
The codebase currently uses various styles for constructing error messages. This lack of consistency makes it harder to understand the errors and can lead to confusion.
# Various styles found in codebase:
raise NexusFileNotFoundError(path)
raise NexusFileNotFoundError(f"File not found: {path}")
raise BackendError("read failed", backend="local", path=path)
We need to standardize the way error messages are formatted. This will make it easier to parse the messages and extract relevant information. A consistent format also improves the overall readability of the code.
For example, we could adopt a convention of always including the file path in the error message for file-related errors. This would make it easier to identify the file that caused the error. Similarly, we could include the operation being performed in the error message to clarify what went wrong.
Consistency is key to maintainability. When error messages are formatted in a consistent way, it's easier to write automated tools to analyze and process them. This can be useful for monitoring the health of the system and identifying potential problems before they escalate.
4. No Structured Logging
The current logging practices are inconsistent and lack structure. This makes it difficult to analyze logs and identify patterns. We see a mix of print statements, basic logging calls, and manual logger calls, all lacking consistent context.
# Mix of: print, logging, logger calls
print(f"Error: {e}") # Bad
logging.error(f"Error: {e}") # No context
logger.error(f"Error reading {path}: {e}") # Better but manual
We need to adopt structured logging, which involves logging data in a consistent, machine-readable format. This allows us to easily analyze the logs and extract meaningful insights. Libraries like structlog can help us achieve this.
Structured logging allows you to include key-value pairs in your log messages, making it easy to filter, sort, and analyze the data. For example, you could include the error type, file path, operation, and user ID in your log messages. This would allow you to quickly identify the root cause of errors and track them over time.
Imagine trying to analyze a large log file with unstructured log messages. It's like trying to find a needle in a haystack. With structured logging, you can easily filter the log messages based on specific criteria, making it much easier to find the information you're looking for.
Proposed Framework
To address these problems, let's introduce a structured error handling framework. This framework will include a well-defined exception hierarchy, an error handler decorator, and structured logging integration.
Exception Hierarchy
We'll start by defining a base exception class, NexusError, which all other exceptions will inherit from. This class will include common attributes such as a message, path, operation, backend, cause, and context.
class NexusError(Exception):
"""Base exception for all Nexus errors."""
def __init__(
self,
message: str,
*,
path: str | None = None,
operation: str | None = None,
backend: str | None = None,
cause: Exception | None = None,
context: dict | None = None,
):
self.message = message
self.path = path
self.operation = operation
self.backend = backend
self.cause = cause
self.context = context or {}
super().__init__(self._format_message())
def _format_message(self) -> str:
parts = [self.message]
if self.path:
parts.append(f"path={self.path}")
if self.operation:
parts.append(f"op={self.operation}")
if self.backend:
parts.append(f"backend={self.backend}")
return " | ".join(parts)
def to_dict(self) -> dict:
"""Serialize for logging/API responses."""
return {
"error": self.__class__.__name__,
"message": self.message,
"path": self.path,
"operation": self.operation,
"backend": self.backend,
"context": self.context,
}
We'll then define specific exception classes that inherit from NexusError, such as FileSystemError, NexusFileNotFoundError, PermissionDeniedError, BackendError, and ValidationError. Each of these classes will represent a specific type of error and can include additional attributes specific to that error type.
class FileSystemError(NexusError):
"""Errors related to filesystem operations."""
pass
class NexusFileNotFoundError(FileSystemError):
"""File or directory not found."""
def __init__(self, path: str, **kwargs):
super().__init__(
f"File not found: {path}",
path=path,
operation="read",
**kwargs,
)
class PermissionDeniedError(NexusError):
"""Permission check failed."""
def __init__(
self,
path: str,
permission: str,
user: str,
**kwargs,
):
super().__init__(
f"Permission denied: {user} cannot {permission} {path}",
path=path,
operation=permission,
context={"user": user, "permission": permission},
**kwargs,
)
class BackendError(NexusError):
"""Backend storage operation failed."""
pass
class ValidationError(NexusError):
"""Input validation failed."""
pass
This hierarchical structure allows for more specific error handling and provides a clear and consistent way to represent different types of errors.
Error Handler Decorator
To ensure consistent error handling across the codebase, we'll create an error handler decorator. This decorator will catch exceptions and re-raise them as NexusError exceptions, adding context and ensuring that all errors are handled in a consistent manner.
from functools import wraps
import logging
logger = logging.getLogger(__name__)
def handle_errors(operation: str):
"""Decorator for consistent error handling."""
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
try:
return fn(*args, **kwargs)
except NexusError:
raise # Already a Nexus error, re-raise
except FileNotFoundError as e:
path = kwargs.get("path", args[1] if len(args) > 1 else "unknown")
raise NexusFileNotFoundError(path, cause=e) from e
except PermissionError as e:
raise PermissionDeniedError(
path=kwargs.get("path", "unknown"),
permission=operation,
user=kwargs.get("context", {}).get("user", "unknown"),
cause=e,
) from e
except Exception as e:
logger.exception(f"Unexpected error in {operation}")
raise NexusError(
f"Unexpected error: {e}",
operation=operation,
cause=e,
) from e
return wrapper
return decorator
# Usage
class NexusFS:
@handle_errors("read")
def read(self, path: str, ...) -> bytes:
...
The @handle_errors decorator takes an operation argument, which describes the operation being performed. This allows us to add context to the error message, such as the operation that failed. The decorator catches specific exceptions, such as FileNotFoundError and PermissionError, and re-raises them as NexusError exceptions with added context. It also catches any unexpected exceptions and logs them before re-raising them as NexusError exceptions.
Structured Logging Integration
To improve logging, we'll integrate structlog into the error handling framework. This will allow us to log errors in a structured format, making it easier to analyze and debug them.
import structlog
logger = structlog.get_logger()
class NexusError(Exception):
def log(self, level: str = "error"):
log_fn = getattr(logger, level)
log_fn(
self.message,
error_type=self.__class__.__name__,
**self.to_dict(),
)
# Usage
try:
content = fs.read(path)
except NexusError as e:
e.log() # Structured log with all context
raise
The log method of the NexusError class logs the error message, error type, and any other relevant context information using structlog. This provides a consistent and structured way to log errors, making it easier to analyze and debug them.
Migration Path
Migrating to the new error handling framework will require a phased approach.
- Update exception hierarchy: Modify existing exceptions to inherit from the appropriate base classes in the new hierarchy.
- Add context to existing raises: Update existing
raisestatements to include relevant context information. - Add error handler decorator: Apply the
@handle_errorsdecorator to functions that need error handling. - Integrate structlog: Replace existing logging calls with
structlogcalls. - Update API error responses: Modify API error responses to use the structured error format.
Priority
🟡 Medium - This improves debugging and monitoring capabilities.
Labels
architecture, error-handling, observability, p2