# 可观测性文档 (Crawlful Hub) > **定位**：Crawlful Hub 可观测性设计文档 - 确保系统的可观测性，便于问题排查和系统优化。 > **更新日期**: 2026-03-18 > **最高优先级参考**: [Service_Design.md](./Service_Design.md) --- ## 1. 可观测性概述 ### 1.1 定义可观测性是指通过系统的外部输出（如日志、指标、追踪）来了解系统内部状态的能力。 ### 1.2 重要性良好的可观测性可以： - 快速定位和解决问题 - 预测和预防系统故障 - 优化系统性能 - 提高系统可靠性 - 降低运维成本 ### 1.3 核心组成 - **业务日志**：记录业务操作的详细信息 - **链路追踪**：追踪请求在系统中的完整路径 - **指标监控**：监控系统的各种指标 --- ## 2. 业务日志 ### 2.1 定义业务日志是指记录业务操作的详细信息，包括操作人、操作时间、操作类型、操作对象、操作结果等。 ### 2.2 实现方法 #### 2.2.1 日志级别 - **DEBUG**：详细的调试信息 - **INFO**：一般的信息 - **WARN**：警告信息 - **ERROR**：错误信息 - **FATAL**：致命错误信息 #### 2.2.2 日志格式 ```json { "timestamp": "2026-03-18T10:00:00Z", "level": "INFO", "service": "ProductService", "method": "updatePrice", "traceId": "1234567890", "tenantId": "tenant-001", "shopId": "shop-001", "businessType": "TOC", "message": "Updating price for product 123 to 99.99", "data": { "productId": "123", "oldPrice": 89.99, "newPrice": 99.99, "roi": 1.5 } } ``` #### 2.2.3 日志框架 **推荐**： - Node.js：winston, bunyan - Java：log4j, logback - Python：logging, structlog **示例**： ```typescript // Node.js/winston 示例 import winston from 'winston'; const logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.json(), transports: [ new winston.transports.Console(), new winston.transports.File({ filename: 'error.log', level: 'error' }), new winston.transports.File({ filename: 'combined.log' }) ] }); // 使用 logger.info('Updating price for product', { productId: '123', oldPrice: 89.99, newPrice: 99.99, traceId: '1234567890', tenantId: 'tenant-001', shopId: 'shop-001', businessType: 'TOC' }); ``` ### 2.3 最佳实践 - **结构化日志**：使用 JSON 格式，便于分析和查询 - **统一日志格式**：所有服务使用相同的日志格式 - **包含必要字段**：时间戳、级别、服务名、方法名、追踪ID、租户ID、店铺ID、业务类型等 - **适当的日志级别**：根据信息的重要性选择合适的级别 - **日志轮转**：定期轮转日志文件，防止日志文件过大 - **日志存储**：使用 ELK Stack、Splunk 等工具存储和分析日志 --- ## 3. 链路追踪 ### 3.1 定义链路追踪是指追踪请求在系统中的完整路径，包括请求经过的所有服务和操作。 ### 3.2 实现方法 #### 3.2.1 追踪ID **定义**：一个唯一的标识符，用于关联同一个请求的所有操作 **实现**： - 生成唯一的 `traceId` - 在请求开始时创建 `traceId` - 在所有相关操作中传递 `traceId` **示例**： ```typescript // 中间件生成 traceId const traceMiddleware = (req, res, next) => { const traceId = req.headers['x-trace-id'] || uuidv4(); req.traceId = traceId; res.setHeader('x-trace-id', traceId); next(); }; // 使用 app.use(traceMiddleware); // 在服务中使用 async updatePrice(productId: string, price: number, traceId: string): Promise { logger.info('Updating price', { traceId, productId, price }); // 业务逻辑 return product; } ``` #### 3.2.2 分布式追踪系统 **推荐**： - Jaeger - Zipkin - OpenTelemetry **示例**： ```typescript // 使用 OpenTelemetry import { trace } from '@opentelemetry/api'; import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'; import { JaegerExporter } from '@opentelemetry/exporter-jaeger'; // 初始化追踪器 const provider = new NodeTracerProvider(); const exporter = new JaegerExporter({ serviceName: 'product-service' }); provider.addSpanProcessor(new SimpleSpanProcessor(exporter)); provider.register(); // 创建 span const tracer = trace.getTracer('product-service'); async updatePrice(productId: string, price: number): Promise { const span = tracer.startSpan('updatePrice'); try { // 业务逻辑 const product = await this.productRepository.findById(productId); // 创建子 span const calculateRoiSpan = tracer.startSpan('calculateROI', { parent: span }); const roi = await this.pricingService.calculateROI(product, price); calculateRoiSpan.end(); product.price = price; product.roi = roi; const updatedProduct = await this.productRepository.save(product); return updatedProduct; } finally { span.end(); } } ``` ### 3.3 最佳实践 - **统一的追踪ID**：所有服务使用相同的追踪ID格式 - **完整的链路**：追踪请求的完整路径，包括所有服务和操作 - **适当的 span**：为重要操作创建 span - **添加上下文信息**：在 span 中添加业务相关的上下文信息 - **集成监控系统**：将追踪数据与监控系统集成 --- ## 4. 指标监控 ### 4.1 定义指标监控是指监控系统的各种指标，如响应时间、调用次数、错误率等。 ### 4.2 实现方法 #### 4.2.1 核心指标 - **响应时间**：请求的处理时间 - **调用次数**：服务的调用次数 - **错误率**：错误请求的比例 - **资源使用**：CPU、内存、磁盘、网络等资源的使用情况 - **业务指标**：订单量、销售额、利润率等 #### 4.2.2 监控系统 **推荐**： - Prometheus - Grafana - Datadog - New Relic **示例**： ```typescript // 使用 Prometheus import prometheus from 'prom-client'; // 定义指标 const requestCounter = new prometheus.Counter({ name: 'http_requests_total', help: 'Total number of HTTP requests', labelNames: ['method', 'endpoint', 'status'] }); const requestDuration = new prometheus.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', labelNames: ['method', 'endpoint'], buckets: [0.1, 0.5, 1, 2, 5] }); // 中间件 const metricsMiddleware = (req, res, next) => { const start = process.hrtime(); res.on('finish', () => { const duration = process.hrtime(start); const seconds = duration[0] + duration[1] / 1e9; requestCounter.inc({ method: req.method, endpoint: req.path, status: res.statusCode }); requestDuration.observe({ method: req.method, endpoint: req.path }, seconds); }); next(); }; // 使用 app.use(metricsMiddleware); app.get('/metrics', (req, res) => { res.set('Content-Type', prometheus.register.contentType); res.end(prometheus.register.metrics()); }); ``` ### 4.3 最佳实践 - **定义关键指标**：根据业务需求定义关键指标 - **设置合理的告警**：为重要指标设置告警 - **可视化**：使用 Grafana 等工具可视化指标 - **定期分析**：定期分析指标数据，发现问题和优化机会 - **集成业务指标**：将业务指标与技术指标结合 --- ## 5. 可观测性集成 ### 5.1 日志、追踪和指标的关联 - **使用相同的 traceId**：将日志、追踪和指标关联起来 - **统一的上下文**：在所有可观测性数据中包含相同的上下文信息 - **集成平台**：使用 ELK Stack、Grafana 等平台集成所有可观测性数据 ### 5.2 监控仪表盘 **推荐仪表盘**： - **系统健康仪表盘**：显示系统的整体健康状态 - **服务性能仪表盘**：显示各个服务的性能指标 - **业务指标仪表盘**：显示业务相关的指标 - **告警仪表盘**：显示当前的告警状态 ### 5.3 告警策略 - **设置合理的阈值**：根据业务需求设置合理的告警阈值 - **分级告警**：根据问题的严重程度设置不同级别的告警 - **告警通知**：通过邮件、短信、Slack 等渠道发送告警通知 - **告警抑制**：避免告警风暴 --- ## 6. 最佳实践 ### 6.1 设计原则 - **可观测性优先**：在系统设计阶段就考虑可观测性 - **统一标准**：所有服务使用相同的可观测性标准 - **适度采集**：采集足够的数据，但不过度采集 - **数据保留**：设置合理的数据保留策略 ### 6.2 实现建议 - **使用开源工具**：如 ELK Stack、Prometheus、Grafana、Jaeger 等 - **自动化部署**：自动化可观测性工具的部署和配置 - **定期演练**：定期演练故障排查和恢复流程 - **持续优化**：根据实际情况持续优化可观测性方案 --- ## 7. 相关文档 - [Service_Design.md](./Service_Design.md) - [Data_Consistency.md](./Data_Consistency.md) - [Architecture_Overview.md](./Architecture_Overview.md) --- *本文档基于服务设计文档，最后更新: 2026-03-18* --- ## 8. 商品中心监控指标（Product Center Metrics） > **设计原则**: 覆盖商品管理、价格策略、SKU映射、库存同步等核心业务指标 ### 8.1 商品管理指标 | 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 | |----------|----------|----------|----------|----------| | **SPU总数** | COUNT(cf_spu) | cf_spu表 | 每日 | - | | **SKU总数** | COUNT(cf_sku) | cf_sku表 | 每日 | - | | **Listing总数** | COUNT(cf_platform_listing) | cf_platform_listing表 | 每日 | - | | **SKU变体密度** | SKU数 / SPU数 | cf_sku, cf_spu | 每周 | >10 预警 | | **映射覆盖率** | 已映射SKU数 / SKU总数 × 100% | cf_sku_mapping | 每日 | <80% 预警 | | **Listing活跃率** | ACTIVE状态Listing数 / Listing总数 × 100% | cf_platform_listing | 每日 | <70% 预警 | ### 8.2 价格管理指标 | 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 | |----------|----------|----------|----------|----------| | **价格策略覆盖率** | 已应用策略的Listing数 / Listing总数 × 100% | cf_price_strategy, cf_platform_listing | 每日 | <60% 预警 | | **价格异常率** | 价格偏离基准价±50%的Listing数 / Listing总数 | cf_platform_listing, cf_sku | 每日 | >5% 预警 | | **利润率分布** | 按利润率区间统计Listing数量 | cf_platform_listing | 每日 | B2C<20%或B2B<15%预警 | | **价格同步成功率** | 同步成功次数 / 总同步次数 × 100% | 同步日志 | 每小时 | <95% 预警 | | **AI调价采纳率** | 采纳AI建议的调价次数 / AI建议总次数 × 100% | AI决策日志 | 每周 | - | ### 8.3 库存同步指标 | 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 | |----------|----------|----------|----------|----------| | **库存同步延迟** | 平台库存更新时间 - 系统库存变更时间 | 同步日志 | 实时 | >5分钟预警 | | **库存差异率** | 平台库存与系统库存不一致的SKU数 / SKU总数 | 库存快照 | 每小时 | >2% 预警 | | **超卖风险SKU数** | 平台库存 > 系统可用库存的SKU数 | 库存对比 | 实时 | >0 预警 | | **库存预警SKU数** | 库存 < 安全库存的SKU数 | cf_sku | 每日 | - | | **缺货SKU数** | 库存 = 0 的SKU数 | cf_sku | 实时 | - | ### 8.4 权限管理指标 | 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 | |----------|----------|----------|----------|----------| | **授权活跃率** | ACTIVE状态授权数 / 总授权数 × 100% | cf_shop_authorization | 每日 | <80% 预警 | | **即将过期授权数** | 7天内将过期的授权数 | cf_shop_authorization | 每日 | >0 预警 | | **授权失败率** | 授权失败次数 / 总授权尝试次数 × 100% | 授权日志 | 每小时 | >10% 预警 | | **API配额使用率** | 已用配额 / 总配额 × 100% | cf_shop_authorization | 每小时 | >80% 预警 | ### 8.5 授权管理指标 | 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 | |----------|----------|----------|----------|----------| | **店铺授权覆盖率** | 已授权店铺数 / 总店铺数 × 100% | cf_shop, cf_shop_authorization | 每日 | <90% 预警 | | **OAuth授权成功率** | OAuth成功次数 / OAuth尝试次数 × 100% | 授权日志 | 每小时 | <95% 预警 | | **Agent授权成功率** | Agent成功次数 / Agent尝试次数 × 100% | 授权日志 | 每小时 | <90% 预警 | | **授权刷新成功率** | Token刷新成功次数 / 刷新尝试次数 × 100% | 授权日志 | 每小时 | <95% 预警 | ### 8.6 核心业务指标 | 指标名称 | 计算公式 | 数据来源 | 监控频率 | 说明 | |----------|----------|----------|----------|------| | **跨平台商品覆盖率** | 在≥2个平台有Listing的SKU数 / SKU总数 | cf_platform_listing | 每周 | 衡量多平台运营能力 | | **平均利润率** | SUM(利润) / SUM(销售额) × 100% | 订单数据 | 每日 | 核心盈利指标 | | **商品转化率** | 有订单的Listing数 / 总Listing数 × 100% | 订单数据 | 每周 | 衡量商品竞争力 | | **AI调价效果** | 调价后利润率 - 调价前利润率 | AI决策日志 | 每周 | 衡量AI调价效果 | ### 8.7 Prometheus指标定义 ```typescript // 商品管理指标 const spuTotalGauge = new Gauge({ name: 'product_spu_total', help: 'Total number of SPU', labelNames: ['tenant_id', 'status'] }); const skuTotalGauge = new Gauge({ name: 'product_sku_total', help: 'Total number of SKU', labelNames: ['tenant_id', 'status'] }); const listingTotalGauge = new Gauge({ name: 'product_listing_total', help: 'Total number of Platform Listing', labelNames: ['tenant_id', 'platform', 'status'] }); // 价格管理指标 const priceSyncCounter = new Counter({ name: 'price_sync_total', help: 'Total number of price sync operations', labelNames: ['tenant_id', 'platform', 'status'] }); const priceSyncDuration = new Histogram({ name: 'price_sync_duration_seconds', help: 'Duration of price sync operations', labelNames: ['tenant_id', 'platform'], buckets: [0.1, 0.5, 1, 2, 5, 10] }); // 库存同步指标 const inventorySyncCounter = new Counter({ name: 'inventory_sync_total', help: 'Total number of inventory sync operations', labelNames: ['tenant_id', 'platform', 'status'] }); const inventoryDiscrepancyGauge = new Gauge({ name: 'inventory_discrepancy_count', help: 'Number of SKUs with inventory discrepancy', labelNames: ['tenant_id', 'platform'] }); // 授权管理指标 const authStatusGauge = new Gauge({ name: 'shop_auth_status', help: 'Shop authorization status (1=active, 0=inactive)', labelNames: ['tenant_id', 'shop_id', 'platform', 'auth_type'] }); const authExpiryGauge = new Gauge({ name: 'shop_auth_expiry_seconds', help: 'Seconds until shop authorization expires', labelNames: ['tenant_id', 'shop_id', 'platform'] }); ``` ### 8.8 告警规则配置 ```yaml # Prometheus告警规则 groups: - name: product_center_alerts rules: # 商品管理告警 - alert: LowMappingCoverage expr: product_mapping_coverage_ratio < 0.8 for: 1h labels: severity: warning annotations: summary: "SKU映射覆盖率过低" description: "SKU映射覆盖率低于80%，当前值: {{ $value }}" # 价格管理告警 - alert: HighPriceAnomalyRate expr: price_anomaly_ratio > 0.05 for: 30m labels: severity: warning annotations: summary: "价格异常率过高" description: "价格异常率超过5%，当前值: {{ $value }}" # 库存同步告警 - alert: InventorySyncDelay expr: inventory_sync_delay_seconds > 300 for: 5m labels: severity: critical annotations: summary: "库存同步延迟过高" description: "库存同步延迟超过5分钟，当前值: {{ $value }}秒" - alert: OversellRisk expr: oversell_risk_sku_count > 0 for: 1m labels: severity: critical annotations: summary: "存在超卖风险" description: "发现{{ $value }}个SKU存在超卖风险" # 授权管理告警 - alert: AuthExpiringSoon expr: shop_auth_expiry_seconds < 604800 for: 1h labels: severity: warning annotations: summary: "店铺授权即将过期" description: "店铺授权将在7天内过期" - alert: HighAuthFailureRate expr: auth_failure_rate > 0.1 for: 30m labels: severity: warning annotations: summary: "授权失败率过高" description: "授权失败率超过10%，当前值: {{ $value }}" ```