Non-Functional Requirements (Quality Attributes)
Non-Functional Requirements (Quality Attributes)¶
Terminology: Quality Attributes vs. Non-Functional Requirements¶
The Carnegie Mellon University Software Engineering Institute (CMU SEI), in their seminal work "Software Architecture in Practice" by Bass, Clements, and Kazman, argues for using the term "quality attributes" rather than "non-functional requirements." Their reasoning centers on the observation that the term "non-functional" is misleading because all requirements, by definition, serve some function within the system context.
Key arguments from CMU SEI research:
- All requirements have function: Every requirement, whether it specifies performance, security, or usability criteria, serves a functional purpose in meeting user and business needs
- Precision in terminology: "Quality attributes" more accurately describes these requirements as measurable properties that determine system quality
- Architectural significance: Quality attributes directly influence architectural decisions and trade-offs, making them central to system design
- Stakeholder communication: The term "quality attributes" better communicates the value and importance of these requirements to non-technical stakeholders
Industry adoption: While both terms are used interchangeably in practice, leading software architecture practitioners and frameworks increasingly favor "quality attributes" for the reasons outlined above.
Note: This document uses both terms interchangeably to align with common industry usage while recognizing the CMU SEI perspective on preferred terminology.
Overview¶
Non-functional requirements (NFRs) define how a system performs rather than what it does. They are critical for ensuring systems meet business expectations for performance, reliability, security, and user experience.
Core NFR Categories¶
1. Performance Requirements¶
Response Time¶
- API Response Time: < 200ms for 95th percentile
- Page Load Time: < 3 seconds for web applications
- Database Query Time: < 100ms for simple queries
- Batch Processing: Define acceptable processing windows
Throughput¶
- Transactions per Second (TPS): Define peak and sustained rates
- Concurrent Users: Maximum simultaneous active users
- Data Processing Rate: Records/messages processed per hour
- API Rate Limits: Requests per minute/hour per client
Resource Utilization¶
- CPU Usage: < 70% under normal load
- Memory Usage: < 80% of allocated memory
- Network Bandwidth: Define limits for data transfer
- Storage I/O: Define IOPS requirements
2. Reliability and Availability¶
Availability Targets¶
| Service Tier | Uptime SLA | Downtime per Year | Downtime per Month | Downtime per Week |
|---|---|---|---|---|
| 99.999% (Five Nines) | 99.999% | 5.26 minutes | 25.9 seconds | 6 seconds |
| Critical | 99.99% | 52.56 minutes | 4.32 minutes | 1.01 minutes |
| High | 99.9% | 8.77 hours | 43.2 minutes | 10.1 minutes |
| Standard | 99.5% | 1.83 days | 21.6 minutes | 5 minutes |
| Basic | 99% | 3.65 days | 7.2 hours | 1.68 hours |
System Availability Calculations¶
Understanding how to calculate total system availability is critical for designing reliable architectures. The methodology varies based on system topology:
1. Series Configuration (Single Critical Path) When components are arranged in series, all must function for the system to work. The system is only as reliable as its weakest component.
Formula: A_system = A₁ × A₂ × A₃ × ... × Aₙ
Example: Web application with App Service (99.99%), SQL Database (99.95%), and Redis Cache (99.9%)
A_system = 0.9999 × 0.9995 × 0.999 = 0.9984 = 99.84%
2. Parallel Configuration (Independent Paths) When components provide redundant functionality, the system remains available if at least one component functions.
Formula: A_system = 1 - (1 - A₁) × (1 - A₂) × ... × (1 - Aₙ)
Example: Load balancer with two web servers (99.9% each)
A_system = 1 - (1 - 0.999) × (1 - 0.999) = 1 - 0.001² = 99.9999%
3. Mixed Configuration (Series + Parallel) Real systems often combine both patterns.
Formula: A_system = A_series × [1 - (1 - A_parallel1) × (1 - A_parallel2)]
Example: Gateway (99.95%) with two redundant databases (99.9% each)
A_system = 0.9995 × [1 - (1 - 0.999)²] = 0.9995 × 0.999999 = 99.9489%
4. Multi-Region Availability For systems deployed across multiple regions.
Formula: A_multi = 1 - (1 - A_single)^R
Where R = number of regions
Example: Single-region availability of 99.95% across 2 regions
A_multi = 1 - (1 - 0.9995)² = 1 - 0.0005² = 99.999975%
Composite SLO Calculation Guidelines¶
- Identify Critical Path: Only include services that could cause total system failure
- Consider Dependencies: Account for all components in the user request flow
- Factor External Dependencies: Include third-party services, APIs, and infrastructure
- Account for Planned Maintenance: Budget for scheduled downtime
- Include Human Factors: Account for operational errors and deployment risks
Practical Calculation Example¶
Scenario: E-commerce application with the following components:
- Azure Front Door: 99.99%
- App Service (2 instances): 99.95% each
- SQL Database (with failover): 99.99%
- Azure Storage: 99.99%
- External payment API: 99.9%
Calculation:
App Service HA = 1 - (1 - 0.9995)² = 99.999975%
System = 0.9999 × 0.99999975 × 0.9999 × 0.9999 × 0.999
System = 99.89% availability
Recovery Requirements¶
- Recovery Time Objective (RTO): Maximum acceptable downtime
- Mission Critical: < 1 hour
- Business Critical: < 4 hours
- Important: < 24 hours
- Recovery Point Objective (RPO): Maximum acceptable data loss
- Mission Critical: < 15 minutes
- Business Critical: < 1 hour
- Important: < 24 hours
- Mean Time to Recovery (MTTR): Average time to restore service
- Mean Time Between Failures (MTBF): Average operational time between failures
- Backup Frequency: Daily, weekly, or real-time replication
Error Handling¶
- Error Rate: < 0.1% of requests should result in errors
- Graceful Degradation: Define fallback behaviors for component failures
- Circuit Breaker Patterns: Prevent cascade failures and allow recovery
- Retry Logic: Exponential backoff strategies with jitter
- Bulkhead Isolation: Isolate critical resources to prevent resource exhaustion
3. Scalability Requirements¶
Horizontal Scaling¶
- Auto-scaling Triggers: CPU, memory, or queue depth thresholds
- Scaling Speed: Time to add/remove instances
- Load Distribution: Even distribution across instances
- State Management: Stateless application design
Vertical Scaling¶
- Resource Limits: Maximum CPU, memory, storage per instance
- Scaling Constraints: Hardware or platform limitations
- Cost Considerations: Performance vs. cost optimization
Data Scaling¶
- Database Sharding: Strategy for horizontal data distribution
- Read Replicas: Number and geographic distribution
- Caching Strategy: Redis, CDN, application-level caching
- Data Archiving: Long-term storage and retrieval strategies
4. Security Requirements¶
Authentication & Authorization¶
- Multi-Factor Authentication: Required for admin access
- Session Management: Timeout periods, secure tokens
- Role-Based Access Control: Principle of least privilege
- API Security: OAuth 2.0, rate limiting, API keys
Data Protection¶
- Encryption Standards: AES-256 for data at rest, TLS 1.3 for transit
- Key Management: Azure Key Vault or similar HSM solutions
- Data Classification: Public, internal, confidential, restricted
- Data Retention: Compliance with GDPR, CCPA requirements
Monitoring & Auditing¶
- Audit Logging: All security events logged and retained
- Real-time Monitoring: Security incident detection and alerting
- Compliance Reporting: Automated compliance status reporting
- Vulnerability Management: Regular scanning and remediation
5. Usability and User Experience¶
User Interface¶
- Accessibility: WCAG 2.1 AA compliance
- Cross-browser Support: Chrome, Firefox, Safari, Edge
- Mobile Responsiveness: Support for tablets and smartphones
- Internationalization: Multi-language and locale support
User Experience¶
- Task Completion Rate: > 95% for primary user workflows
- User Error Rate: < 5% of user actions result in errors
- Learning Curve: New users productive within defined timeframe
- Help and Documentation: Context-sensitive help available
6. Maintainability and Operability¶
Code Quality¶
- Code Coverage: > 80% test coverage for critical paths
- Cyclomatic Complexity: Keep methods under complexity threshold
- Technical Debt: Regular refactoring and cleanup schedules
- Documentation: Up-to-date API documentation and runbooks
Deployment and Operations¶
- Deployment Frequency: Support for frequent, automated deployments
- Rollback Time: < 5 minutes to rollback failed deployments
- Configuration Management: Environment-specific configurations
- Monitoring and Alerting: Comprehensive observability stack
NFR Definition Process¶
1. Stakeholder Requirements Gathering¶
- Business Stakeholders: Performance expectations, SLA requirements
- End Users: User experience and accessibility needs
- Operations Team: Monitoring, maintenance, and support requirements
- Security Team: Compliance and security control requirements
2. Requirements Analysis and Prioritization¶
Use MoSCoW method for prioritization:
- Must Have: Critical for business operation
- Should Have: Important for user satisfaction
- Could Have: Nice to have if resources allow
- Won't Have: Explicitly out of scope
3. Quantitative Requirements Definition¶
Transform qualitative requirements into measurable criteria:
- "Fast response times" → "API responses < 200ms for 95th percentile"
- "Highly available" → "99.9% uptime SLA with < 4 hours monthly downtime"
- "Secure system" → "Zero tolerance for data breaches, PCI DSS compliance"
4. NFR Testing Strategy¶
Performance Testing¶
- Load Testing: Normal expected load using JMeter or Azure Load Testing
- Stress Testing: Beyond normal capacity to find breaking points
- Spike Testing: Sudden load increases and system recovery
- Volume Testing: Large amounts of data processing
Security Testing¶
- Penetration Testing: Simulated attacks by security professionals
- Vulnerability Scanning: Automated scanning for known vulnerabilities
- Authentication Testing: Multi-factor authentication and session management
- Data Protection Testing: Encryption and access control validation
Reliability Testing¶
- Chaos Engineering: Deliberate failure injection (Netflix Chaos Monkey)
- Disaster Recovery Testing: Full system recovery procedures
- Backup and Restore Testing: Data recovery verification
- Failover Testing: High availability configuration validation
5. Monitoring and Measurement¶
Key Performance Indicators (KPIs)¶
- Response Time Percentiles: P50, P95, P99 response times
- Error Rates: 4xx and 5xx HTTP error percentages
- Availability Metrics: Uptime percentage over time periods
- Resource Utilization: CPU, memory, disk, network usage
Alerting Thresholds¶
- Warning Levels: 80% of NFR thresholds
- Critical Levels: 95% of NFR thresholds
- Escalation Procedures: Automated escalation paths
- On-call Rotation: 24/7 support for critical systems
NFR Documentation Template¶
# Non-Functional Requirements: [System Name]
## Performance Requirements
| Requirement | Target | Measurement Method | Priority |
| ----------------- | ------------------ | ---------------------- | --------- |
| API Response Time | < 200ms (P95) | Application monitoring | Must Have |
| Concurrent Users | 1,000 simultaneous | Load testing | Must Have |
## Availability Requirements
| Requirement | Target | Measurement Method | Priority |
| ------------- | -------- | ------------------------- | --------- |
| System Uptime | 99.9% | Infrastructure monitoring | Must Have |
| RTO | < 1 hour | Disaster recovery testing | Must Have |
## Security Requirements
| Requirement | Target | Measurement Method | Priority |
| --------------- | ----------------------------------- | ------------------ | --------- |
| Data Encryption | AES-256 at rest, TLS 1.3 in transit | Security scanning | Must Have |
| Authentication | MFA for admin access | Security audit | Must Have |
## Testing Approach
- Performance: JMeter load tests, Azure Load Testing
- Security: OWASP ZAP scanning, penetration testing
- Reliability: Chaos engineering, disaster recovery drills
## Monitoring Strategy
- APM: Application Performance Monitoring (New Relic, Datadog)
- Infrastructure: Azure Monitor, CloudWatch
- Security: SIEM integration, security event correlation
## Acceptance Criteria
[Define specific criteria that must be met before release]
Integration with Development Process¶
Planning Phase¶
- Define NFRs during epic and feature planning
- Include NFR acceptance criteria in user stories
- Estimate effort for NFR implementation and testing
Development Phase¶
- Implement monitoring and instrumentation code
- Include NFR-focused unit and integration tests
- Conduct regular performance profiling during development
Testing Phase¶
- Execute comprehensive NFR test suites
- Performance baseline establishment and regression testing
- Security scanning and penetration testing
Release Phase¶
- NFR sign-off required before production deployment
- Production monitoring setup and validation
- Post-deployment NFR verification and tuning
Tools and Resources¶
Performance Testing¶
- Azure Load Testing: Cloud-based load testing service
- JMeter: Open-source performance testing tool
- k6: Developer-centric performance testing tool
- LoadRunner: Enterprise performance testing platform
Monitoring and Observability¶
- Azure Application Insights: Application performance monitoring
- Datadog: Full-stack monitoring platform
- New Relic: Application performance monitoring
- Prometheus + Grafana: Open-source monitoring stack
Security Testing¶
- OWASP ZAP: Web application security scanner
- Burp Suite: Web vulnerability scanner
- Nessus: Vulnerability assessment tool
- Qualys: Cloud security and compliance platform
Common Pitfalls and Best Practices¶
Pitfalls to Avoid¶
- Vague Requirements: "System should be fast" instead of specific metrics
- Late Definition: Defining NFRs after development begins
- No Testing Strategy: Assuming NFRs will be met without validation
- Ignoring Trade-offs: Not considering cost vs. performance implications
Best Practices¶
- Start Early: Define NFRs during initial requirements gathering
- Be Specific: Use quantifiable, measurable criteria
- Test Continuously: Include NFR testing in CI/CD pipelines
- Monitor Production: Continuously validate NFRs in production
- Review Regularly: Update NFRs as business needs evolve