Monitoring Overview
Monitoring and observability are crucial for maintaining healthy, performant applications. This guide covers the monitoring tools, metrics, and best practices for Solverhood applications.
Monitoring Stack
Core Components
- Application Performance Monitoring (APM): New Relic, DataDog, or Sentry
- Infrastructure Monitoring: Prometheus with Grafana
- Logging: ELK Stack (Elasticsearch, Logstash, Kibana) or CloudWatch
- Error Tracking: Sentry for error monitoring and alerting
- Health Checks: Custom health check endpoints
Recommended Tools
- New Relic: Full-stack observability platform
- Prometheus + Grafana: Metrics collection and visualization
- Sentry: Error tracking and performance monitoring
- CloudWatch: AWS-native monitoring and logging
- Datadog: Comprehensive monitoring platform
Key Metrics to Monitor
Application Metrics
- Response Time: Average, 95th percentile, 99th percentile
- Throughput: Requests per second (RPS)
- Error Rate: Percentage of failed requests
- Availability: Uptime percentage
- Memory Usage: Heap usage, garbage collection
- CPU Usage: CPU utilization percentage
Business Metrics
- User Activity: Daily/Monthly Active Users (DAU/MAU)
- Feature Usage: Most used features and endpoints
- Conversion Rates: User journey completion rates
- Revenue Metrics: If applicable to your business
Infrastructure Metrics
- Server Resources: CPU, memory, disk usage
- Database Performance: Query execution time, connection pool usage
- Network: Bandwidth usage, latency
- Cache Performance: Hit/miss ratios
Implementation
1. Health Check Endpoints
Implement comprehensive health checks:
const express = require('express');
const app = express();
// Basic health check
app.get('/health', (req, res) => {
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
version: process.env.APP_VERSION || '1.0.0',
});
});
// Detailed health check
app.get('/health/detailed', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
externalServices: await checkExternalServices(),
};
const isHealthy = Object.values(checks).every((check) => check.status === 'healthy');
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks,
});
});
async function checkDatabase() {
try {
await db.query('SELECT 1');
return { status: 'healthy', responseTime: Date.now() };
} catch (error) {
return { status: 'unhealthy', error: error.message };
}
}
2. Application Performance Monitoring
Set up APM with New Relic:
// Install New Relic
npm install newrelic
// newrelic.js configuration
exports.config = {
app_name: ['Solverhood App'],
license_key: process.env.NEW_RELIC_LICENSE_KEY,
logging: {
level: 'info'
},
distributed_tracing: {
enabled: true
},
transaction_tracer: {
enabled: true,
transaction_threshold: 5,
record_sql: 'obfuscated',
stack_trace_threshold: 0.5
}
};
3. Structured Logging
Implement structured logging with Winston:
const winston = require('winston');
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: { service: 'solverhood-api' },
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
],
});
// Add console transport in development
if (process.env.NODE_ENV !== 'production') {
logger.add(
new winston.transports.Console({
format: winston.format.simple(),
})
);
}
// Usage example
logger.info('User logged in', {
userId: user.id,
email: user.email,
ip: req.ip,
userAgent: req.get('User-Agent'),
});
4. Error Tracking
Set up Sentry for error tracking:
const Sentry = require('@sentry/node');
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 1.0,
integrations: [new Sentry.Integrations.Http({ tracing: true }), new Sentry.Integrations.Express({ app })],
});
// Error handling middleware
app.use(Sentry.Handlers.errorHandler());
// Capture errors
app.get('/debug-sentry', function mainHandler(req, res) {
throw new Error('My first Sentry error!');
});
Alerting
1. Alert Rules
Define alerting rules for critical metrics:
# prometheus/rules/alerts.yml
groups:
- name: solverhood-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: 'High error rate detected'
description: 'Error rate is {{ $value }} errors per second'
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: 'High response time detected'
description: '95th percentile response time is {{ $value }} seconds'
- alert: DatabaseConnectionHigh
expr: pg_stat_database_numbackends > 80
for: 1m
labels:
severity: warning
annotations:
summary: 'High database connections'
description: 'Database has {{ $value }} active connections'
2. Notification Channels
Configure notification channels:
// notifications/slack.js
const { WebClient } = require('@slack/web-api');
const slack = new WebClient(process.env.SLACK_BOT_TOKEN);
async function sendSlackAlert(alert) {
try {
await slack.chat.postMessage({
channel: process.env.SLACK_ALERT_CHANNEL,
text: `🚨 Alert: ${alert.summary}`,
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*${alert.summary}*\n${alert.description}`,
},
},
{
type: 'context',
elements: [
{
type: 'mrkdwn',
text: `Severity: ${alert.labels.severity} | Environment: ${process.env.NODE_ENV}`,
},
],
},
],
});
} catch (error) {
console.error('Failed to send Slack alert:', error);
}
}
Dashboards
1. Grafana Dashboard
Create comprehensive dashboards:
{
"dashboard": {
"title": "Solverhood Application Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{route}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
}
]
}
}
2. Business Metrics Dashboard
Track business-critical metrics:
// metrics/business.js
const { register, Counter, Histogram } = require('prom-client');
// Business metrics
const userRegistrations = new Counter({
name: 'user_registrations_total',
help: 'Total number of user registrations',
labelNames: ['source'],
});
const projectCreations = new Counter({
name: 'project_creations_total',
help: 'Total number of projects created',
labelNames: ['user_type'],
});
const apiUsage = new Histogram({
name: 'api_requests_duration_seconds',
help: 'API request duration',
labelNames: ['endpoint', 'method'],
buckets: [0.1, 0.5, 1, 2, 5],
});
// Usage in application
app.post('/api/users', async (req, res) => {
const timer = apiUsage.startTimer();
try {
const user = await createUser(req.body);
userRegistrations.inc({ source: req.body.source || 'direct' });
timer({ endpoint: '/api/users', method: 'POST' });
res.status(201).json(user);
} catch (error) {
timer({ endpoint: '/api/users', method: 'POST' });
res.status(500).json({ error: error.message });
}
});
Log Analysis
1. Log Aggregation
Set up centralized logging:
// logging/elasticsearch.js
const { Client } = require('@elastic/elasticsearch');
const client = new Client({
node: process.env.ELASTICSEARCH_URL,
auth: {
username: process.env.ELASTICSEARCH_USERNAME,
password: process.env.ELASTICSEARCH_PASSWORD,
},
});
async function logToElasticsearch(logEntry) {
try {
await client.index({
index: `solverhood-logs-${new Date().toISOString().split('T')[0]}`,
body: {
timestamp: new Date().toISOString(),
level: logEntry.level,
message: logEntry.message,
service: logEntry.service,
...logEntry.meta,
},
});
} catch (error) {
console.error('Failed to log to Elasticsearch:', error);
}
}
2. Log Search and Analysis
Create Kibana dashboards for log analysis:
{
"dashboard": {
"title": "Application Logs Analysis",
"panels": [
{
"title": "Log Volume by Level",
"type": "visualization",
"visState": {
"type": "pie",
"aggs": [
{
"type": "count",
"schema": "metric"
},
{
"type": "terms",
"schema": "segment",
"params": {
"field": "level.keyword"
}
}
]
}
}
]
}
}
Performance Optimization
1. Database Monitoring
Monitor database performance:
-- Slow query analysis
SELECT
query,
calls,
total_time,
mean_time,
rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Connection monitoring
SELECT
datname,
numbackends,
xact_commit,
xact_rollback
FROM pg_stat_database;
2. Memory Monitoring
Track memory usage:
// monitoring/memory.js
const v8 = require('v8');
function logMemoryUsage() {
const usage = process.memoryUsage();
const heapStats = v8.getHeapStatistics();
logger.info('Memory usage', {
rss: usage.rss,
heapUsed: usage.heapUsed,
heapTotal: usage.heapTotal,
external: usage.external,
heapSizeLimit: heapStats.heap_size_limit,
totalAvailableSize: heapStats.total_available_size,
});
}
// Log memory usage every 5 minutes
setInterval(logMemoryUsage, 5 * 60 * 1000);
Best Practices
1. Monitoring Strategy
- Start Simple: Begin with basic health checks and error tracking
- Gradual Enhancement: Add more sophisticated monitoring over time
- Focus on Business Impact: Monitor metrics that directly affect users
- Automate Everything: Use automated alerting and response
2. Alert Management
- Avoid Alert Fatigue: Set appropriate thresholds and cooldown periods
- Escalation Policies: Define clear escalation procedures
- Runbooks: Create documentation for common issues
- Post-Incident Reviews: Learn from incidents to improve monitoring
3. Data Retention
- Log Retention: Define retention policies for different log types
- Metrics Storage: Plan for long-term metrics storage
- Cost Management: Monitor costs of monitoring infrastructure
Next Steps
- Error Tracking Setup - Coming soon
- Performance Monitoring - Coming soon
- Log Management - Coming soon
- Alert Configuration - Coming soon