Skip to main content

Monitoring Overview

Monitoring and observability are crucial for maintaining healthy, performant applications. This guide covers the monitoring tools, metrics, and best practices for Solverhood applications.

Monitoring Stack

Core Components

  • Application Performance Monitoring (APM): New Relic, DataDog, or Sentry
  • Infrastructure Monitoring: Prometheus with Grafana
  • Logging: ELK Stack (Elasticsearch, Logstash, Kibana) or CloudWatch
  • Error Tracking: Sentry for error monitoring and alerting
  • Health Checks: Custom health check endpoints
  • New Relic: Full-stack observability platform
  • Prometheus + Grafana: Metrics collection and visualization
  • Sentry: Error tracking and performance monitoring
  • CloudWatch: AWS-native monitoring and logging
  • Datadog: Comprehensive monitoring platform

Key Metrics to Monitor

Application Metrics

  • Response Time: Average, 95th percentile, 99th percentile
  • Throughput: Requests per second (RPS)
  • Error Rate: Percentage of failed requests
  • Availability: Uptime percentage
  • Memory Usage: Heap usage, garbage collection
  • CPU Usage: CPU utilization percentage

Business Metrics

  • User Activity: Daily/Monthly Active Users (DAU/MAU)
  • Feature Usage: Most used features and endpoints
  • Conversion Rates: User journey completion rates
  • Revenue Metrics: If applicable to your business

Infrastructure Metrics

  • Server Resources: CPU, memory, disk usage
  • Database Performance: Query execution time, connection pool usage
  • Network: Bandwidth usage, latency
  • Cache Performance: Hit/miss ratios

Implementation

1. Health Check Endpoints

Implement comprehensive health checks:

const express = require('express');
const app = express();

// Basic health check
app.get('/health', (req, res) => {
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
version: process.env.APP_VERSION || '1.0.0',
});
});

// Detailed health check
app.get('/health/detailed', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
externalServices: await checkExternalServices(),
};

const isHealthy = Object.values(checks).every((check) => check.status === 'healthy');

res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks,
});
});

async function checkDatabase() {
try {
await db.query('SELECT 1');
return { status: 'healthy', responseTime: Date.now() };
} catch (error) {
return { status: 'unhealthy', error: error.message };
}
}

2. Application Performance Monitoring

Set up APM with New Relic:

// Install New Relic
npm install newrelic

// newrelic.js configuration
exports.config = {
app_name: ['Solverhood App'],
license_key: process.env.NEW_RELIC_LICENSE_KEY,
logging: {
level: 'info'
},
distributed_tracing: {
enabled: true
},
transaction_tracer: {
enabled: true,
transaction_threshold: 5,
record_sql: 'obfuscated',
stack_trace_threshold: 0.5
}
};

3. Structured Logging

Implement structured logging with Winston:

const winston = require('winston');

const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: { service: 'solverhood-api' },
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
],
});

// Add console transport in development
if (process.env.NODE_ENV !== 'production') {
logger.add(
new winston.transports.Console({
format: winston.format.simple(),
})
);
}

// Usage example
logger.info('User logged in', {
userId: user.id,
email: user.email,
ip: req.ip,
userAgent: req.get('User-Agent'),
});

4. Error Tracking

Set up Sentry for error tracking:

const Sentry = require('@sentry/node');

Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 1.0,
integrations: [new Sentry.Integrations.Http({ tracing: true }), new Sentry.Integrations.Express({ app })],
});

// Error handling middleware
app.use(Sentry.Handlers.errorHandler());

// Capture errors
app.get('/debug-sentry', function mainHandler(req, res) {
throw new Error('My first Sentry error!');
});

Alerting

1. Alert Rules

Define alerting rules for critical metrics:

# prometheus/rules/alerts.yml
groups:
- name: solverhood-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: 'High error rate detected'
description: 'Error rate is {{ $value }} errors per second'

- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: 'High response time detected'
description: '95th percentile response time is {{ $value }} seconds'

- alert: DatabaseConnectionHigh
expr: pg_stat_database_numbackends > 80
for: 1m
labels:
severity: warning
annotations:
summary: 'High database connections'
description: 'Database has {{ $value }} active connections'

2. Notification Channels

Configure notification channels:

// notifications/slack.js
const { WebClient } = require('@slack/web-api');

const slack = new WebClient(process.env.SLACK_BOT_TOKEN);

async function sendSlackAlert(alert) {
try {
await slack.chat.postMessage({
channel: process.env.SLACK_ALERT_CHANNEL,
text: `🚨 Alert: ${alert.summary}`,
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*${alert.summary}*\n${alert.description}`,
},
},
{
type: 'context',
elements: [
{
type: 'mrkdwn',
text: `Severity: ${alert.labels.severity} | Environment: ${process.env.NODE_ENV}`,
},
],
},
],
});
} catch (error) {
console.error('Failed to send Slack alert:', error);
}
}

Dashboards

1. Grafana Dashboard

Create comprehensive dashboards:

{
"dashboard": {
"title": "Solverhood Application Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{route}}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
}
]
}
}

2. Business Metrics Dashboard

Track business-critical metrics:

// metrics/business.js
const { register, Counter, Histogram } = require('prom-client');

// Business metrics
const userRegistrations = new Counter({
name: 'user_registrations_total',
help: 'Total number of user registrations',
labelNames: ['source'],
});

const projectCreations = new Counter({
name: 'project_creations_total',
help: 'Total number of projects created',
labelNames: ['user_type'],
});

const apiUsage = new Histogram({
name: 'api_requests_duration_seconds',
help: 'API request duration',
labelNames: ['endpoint', 'method'],
buckets: [0.1, 0.5, 1, 2, 5],
});

// Usage in application
app.post('/api/users', async (req, res) => {
const timer = apiUsage.startTimer();

try {
const user = await createUser(req.body);
userRegistrations.inc({ source: req.body.source || 'direct' });

timer({ endpoint: '/api/users', method: 'POST' });
res.status(201).json(user);
} catch (error) {
timer({ endpoint: '/api/users', method: 'POST' });
res.status(500).json({ error: error.message });
}
});

Log Analysis

1. Log Aggregation

Set up centralized logging:

// logging/elasticsearch.js
const { Client } = require('@elastic/elasticsearch');

const client = new Client({
node: process.env.ELASTICSEARCH_URL,
auth: {
username: process.env.ELASTICSEARCH_USERNAME,
password: process.env.ELASTICSEARCH_PASSWORD,
},
});

async function logToElasticsearch(logEntry) {
try {
await client.index({
index: `solverhood-logs-${new Date().toISOString().split('T')[0]}`,
body: {
timestamp: new Date().toISOString(),
level: logEntry.level,
message: logEntry.message,
service: logEntry.service,
...logEntry.meta,
},
});
} catch (error) {
console.error('Failed to log to Elasticsearch:', error);
}
}

2. Log Search and Analysis

Create Kibana dashboards for log analysis:

{
"dashboard": {
"title": "Application Logs Analysis",
"panels": [
{
"title": "Log Volume by Level",
"type": "visualization",
"visState": {
"type": "pie",
"aggs": [
{
"type": "count",
"schema": "metric"
},
{
"type": "terms",
"schema": "segment",
"params": {
"field": "level.keyword"
}
}
]
}
}
]
}
}

Performance Optimization

1. Database Monitoring

Monitor database performance:

-- Slow query analysis
SELECT
query,
calls,
total_time,
mean_time,
rows
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Connection monitoring
SELECT
datname,
numbackends,
xact_commit,
xact_rollback
FROM pg_stat_database;

2. Memory Monitoring

Track memory usage:

// monitoring/memory.js
const v8 = require('v8');

function logMemoryUsage() {
const usage = process.memoryUsage();
const heapStats = v8.getHeapStatistics();

logger.info('Memory usage', {
rss: usage.rss,
heapUsed: usage.heapUsed,
heapTotal: usage.heapTotal,
external: usage.external,
heapSizeLimit: heapStats.heap_size_limit,
totalAvailableSize: heapStats.total_available_size,
});
}

// Log memory usage every 5 minutes
setInterval(logMemoryUsage, 5 * 60 * 1000);

Best Practices

1. Monitoring Strategy

  • Start Simple: Begin with basic health checks and error tracking
  • Gradual Enhancement: Add more sophisticated monitoring over time
  • Focus on Business Impact: Monitor metrics that directly affect users
  • Automate Everything: Use automated alerting and response

2. Alert Management

  • Avoid Alert Fatigue: Set appropriate thresholds and cooldown periods
  • Escalation Policies: Define clear escalation procedures
  • Runbooks: Create documentation for common issues
  • Post-Incident Reviews: Learn from incidents to improve monitoring

3. Data Retention

  • Log Retention: Define retention policies for different log types
  • Metrics Storage: Plan for long-term metrics storage
  • Cost Management: Monitor costs of monitoring infrastructure

Next Steps

  • Error Tracking Setup - Coming soon
  • Performance Monitoring - Coming soon
  • Log Management - Coming soon
  • Alert Configuration - Coming soon

Resources