2026-03-18 19:12:38 +08:00
|
|
|
|
# 可观测性文档 (Crawlful Hub)
|
|
|
|
|
|
|
|
|
|
|
|
> **定位**:Crawlful Hub 可观测性设计文档 - 确保系统的可观测性,便于问题排查和系统优化。
|
|
|
|
|
|
> **更新日期**: 2026-03-18
|
|
|
|
|
|
> **最高优先级参考**: [Service_Design.md](./Service_Design.md)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 1. 可观测性概述
|
|
|
|
|
|
|
|
|
|
|
|
### 1.1 定义
|
|
|
|
|
|
|
|
|
|
|
|
可观测性是指通过系统的外部输出(如日志、指标、追踪)来了解系统内部状态的能力。
|
|
|
|
|
|
|
|
|
|
|
|
### 1.2 重要性
|
|
|
|
|
|
|
|
|
|
|
|
良好的可观测性可以:
|
|
|
|
|
|
- 快速定位和解决问题
|
|
|
|
|
|
- 预测和预防系统故障
|
|
|
|
|
|
- 优化系统性能
|
|
|
|
|
|
- 提高系统可靠性
|
|
|
|
|
|
- 降低运维成本
|
|
|
|
|
|
|
|
|
|
|
|
### 1.3 核心组成
|
|
|
|
|
|
|
|
|
|
|
|
- **业务日志**:记录业务操作的详细信息
|
|
|
|
|
|
- **链路追踪**:追踪请求在系统中的完整路径
|
|
|
|
|
|
- **指标监控**:监控系统的各种指标
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 2. 业务日志
|
|
|
|
|
|
|
|
|
|
|
|
### 2.1 定义
|
|
|
|
|
|
|
|
|
|
|
|
业务日志是指记录业务操作的详细信息,包括操作人、操作时间、操作类型、操作对象、操作结果等。
|
|
|
|
|
|
|
|
|
|
|
|
### 2.2 实现方法
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.2.1 日志级别
|
|
|
|
|
|
|
|
|
|
|
|
- **DEBUG**:详细的调试信息
|
|
|
|
|
|
- **INFO**:一般的信息
|
|
|
|
|
|
- **WARN**:警告信息
|
|
|
|
|
|
- **ERROR**:错误信息
|
|
|
|
|
|
- **FATAL**:致命错误信息
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.2.2 日志格式
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{
|
|
|
|
|
|
"timestamp": "2026-03-18T10:00:00Z",
|
|
|
|
|
|
"level": "INFO",
|
|
|
|
|
|
"service": "ProductService",
|
|
|
|
|
|
"method": "updatePrice",
|
|
|
|
|
|
"traceId": "1234567890",
|
|
|
|
|
|
"tenantId": "tenant-001",
|
|
|
|
|
|
"shopId": "shop-001",
|
|
|
|
|
|
"businessType": "TOC",
|
|
|
|
|
|
"message": "Updating price for product 123 to 99.99",
|
|
|
|
|
|
"data": {
|
|
|
|
|
|
"productId": "123",
|
|
|
|
|
|
"oldPrice": 89.99,
|
|
|
|
|
|
"newPrice": 99.99,
|
|
|
|
|
|
"roi": 1.5
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 2.2.3 日志框架
|
|
|
|
|
|
|
|
|
|
|
|
**推荐**:
|
|
|
|
|
|
- Node.js:winston, bunyan
|
|
|
|
|
|
- Java:log4j, logback
|
|
|
|
|
|
- Python:logging, structlog
|
|
|
|
|
|
|
|
|
|
|
|
**示例**:
|
|
|
|
|
|
```typescript
|
|
|
|
|
|
// Node.js/winston 示例
|
|
|
|
|
|
import winston from 'winston';
|
|
|
|
|
|
|
|
|
|
|
|
const logger = winston.createLogger({
|
|
|
|
|
|
level: process.env.LOG_LEVEL || 'info',
|
|
|
|
|
|
format: winston.format.json(),
|
|
|
|
|
|
transports: [
|
|
|
|
|
|
new winston.transports.Console(),
|
|
|
|
|
|
new winston.transports.File({ filename: 'error.log', level: 'error' }),
|
|
|
|
|
|
new winston.transports.File({ filename: 'combined.log' })
|
|
|
|
|
|
]
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
// 使用
|
|
|
|
|
|
logger.info('Updating price for product', {
|
|
|
|
|
|
productId: '123',
|
|
|
|
|
|
oldPrice: 89.99,
|
|
|
|
|
|
newPrice: 99.99,
|
|
|
|
|
|
traceId: '1234567890',
|
|
|
|
|
|
tenantId: 'tenant-001',
|
|
|
|
|
|
shopId: 'shop-001',
|
|
|
|
|
|
businessType: 'TOC'
|
|
|
|
|
|
});
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2.3 最佳实践
|
|
|
|
|
|
|
|
|
|
|
|
- **结构化日志**:使用 JSON 格式,便于分析和查询
|
|
|
|
|
|
- **统一日志格式**:所有服务使用相同的日志格式
|
|
|
|
|
|
- **包含必要字段**:时间戳、级别、服务名、方法名、追踪ID、租户ID、店铺ID、业务类型等
|
|
|
|
|
|
- **适当的日志级别**:根据信息的重要性选择合适的级别
|
|
|
|
|
|
- **日志轮转**:定期轮转日志文件,防止日志文件过大
|
|
|
|
|
|
- **日志存储**:使用 ELK Stack、Splunk 等工具存储和分析日志
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 3. 链路追踪
|
|
|
|
|
|
|
|
|
|
|
|
### 3.1 定义
|
|
|
|
|
|
|
|
|
|
|
|
链路追踪是指追踪请求在系统中的完整路径,包括请求经过的所有服务和操作。
|
|
|
|
|
|
|
|
|
|
|
|
### 3.2 实现方法
|
|
|
|
|
|
|
|
|
|
|
|
#### 3.2.1 追踪ID
|
|
|
|
|
|
|
|
|
|
|
|
**定义**:一个唯一的标识符,用于关联同一个请求的所有操作
|
|
|
|
|
|
|
|
|
|
|
|
**实现**:
|
|
|
|
|
|
- 生成唯一的 `traceId`
|
|
|
|
|
|
- 在请求开始时创建 `traceId`
|
|
|
|
|
|
- 在所有相关操作中传递 `traceId`
|
|
|
|
|
|
|
|
|
|
|
|
**示例**:
|
|
|
|
|
|
```typescript
|
|
|
|
|
|
// 中间件生成 traceId
|
|
|
|
|
|
const traceMiddleware = (req, res, next) => {
|
|
|
|
|
|
const traceId = req.headers['x-trace-id'] || uuidv4();
|
|
|
|
|
|
req.traceId = traceId;
|
|
|
|
|
|
res.setHeader('x-trace-id', traceId);
|
|
|
|
|
|
next();
|
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
// 使用
|
|
|
|
|
|
app.use(traceMiddleware);
|
|
|
|
|
|
|
|
|
|
|
|
// 在服务中使用
|
|
|
|
|
|
async updatePrice(productId: string, price: number, traceId: string): Promise<Product> {
|
|
|
|
|
|
logger.info('Updating price', { traceId, productId, price });
|
|
|
|
|
|
// 业务逻辑
|
|
|
|
|
|
return product;
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### 3.2.2 分布式追踪系统
|
|
|
|
|
|
|
|
|
|
|
|
**推荐**:
|
|
|
|
|
|
- Jaeger
|
|
|
|
|
|
- Zipkin
|
|
|
|
|
|
- OpenTelemetry
|
|
|
|
|
|
|
|
|
|
|
|
**示例**:
|
|
|
|
|
|
```typescript
|
|
|
|
|
|
// 使用 OpenTelemetry
|
|
|
|
|
|
import { trace } from '@opentelemetry/api';
|
|
|
|
|
|
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
|
|
|
|
|
|
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
|
|
|
|
|
|
|
|
|
|
|
|
// 初始化追踪器
|
|
|
|
|
|
const provider = new NodeTracerProvider();
|
|
|
|
|
|
const exporter = new JaegerExporter({
|
|
|
|
|
|
serviceName: 'product-service'
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
|
|
|
|
|
|
provider.register();
|
|
|
|
|
|
|
|
|
|
|
|
// 创建 span
|
|
|
|
|
|
const tracer = trace.getTracer('product-service');
|
|
|
|
|
|
|
|
|
|
|
|
async updatePrice(productId: string, price: number): Promise<Product> {
|
|
|
|
|
|
const span = tracer.startSpan('updatePrice');
|
|
|
|
|
|
|
|
|
|
|
|
try {
|
|
|
|
|
|
// 业务逻辑
|
|
|
|
|
|
const product = await this.productRepository.findById(productId);
|
|
|
|
|
|
|
|
|
|
|
|
// 创建子 span
|
|
|
|
|
|
const calculateRoiSpan = tracer.startSpan('calculateROI', {
|
|
|
|
|
|
parent: span
|
|
|
|
|
|
});
|
|
|
|
|
|
const roi = await this.pricingService.calculateROI(product, price);
|
|
|
|
|
|
calculateRoiSpan.end();
|
|
|
|
|
|
|
|
|
|
|
|
product.price = price;
|
|
|
|
|
|
product.roi = roi;
|
|
|
|
|
|
const updatedProduct = await this.productRepository.save(product);
|
|
|
|
|
|
|
|
|
|
|
|
return updatedProduct;
|
|
|
|
|
|
} finally {
|
|
|
|
|
|
span.end();
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3.3 最佳实践
|
|
|
|
|
|
|
|
|
|
|
|
- **统一的追踪ID**:所有服务使用相同的追踪ID格式
|
|
|
|
|
|
- **完整的链路**:追踪请求的完整路径,包括所有服务和操作
|
|
|
|
|
|
- **适当的 span**:为重要操作创建 span
|
|
|
|
|
|
- **添加上下文信息**:在 span 中添加业务相关的上下文信息
|
|
|
|
|
|
- **集成监控系统**:将追踪数据与监控系统集成
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 4. 指标监控
|
|
|
|
|
|
|
|
|
|
|
|
### 4.1 定义
|
|
|
|
|
|
|
|
|
|
|
|
指标监控是指监控系统的各种指标,如响应时间、调用次数、错误率等。
|
|
|
|
|
|
|
|
|
|
|
|
### 4.2 实现方法
|
|
|
|
|
|
|
|
|
|
|
|
#### 4.2.1 核心指标
|
|
|
|
|
|
|
|
|
|
|
|
- **响应时间**:请求的处理时间
|
|
|
|
|
|
- **调用次数**:服务的调用次数
|
|
|
|
|
|
- **错误率**:错误请求的比例
|
|
|
|
|
|
- **资源使用**:CPU、内存、磁盘、网络等资源的使用情况
|
|
|
|
|
|
- **业务指标**:订单量、销售额、利润率等
|
|
|
|
|
|
|
|
|
|
|
|
#### 4.2.2 监控系统
|
|
|
|
|
|
|
|
|
|
|
|
**推荐**:
|
|
|
|
|
|
- Prometheus
|
|
|
|
|
|
- Grafana
|
|
|
|
|
|
- Datadog
|
|
|
|
|
|
- New Relic
|
|
|
|
|
|
|
|
|
|
|
|
**示例**:
|
|
|
|
|
|
```typescript
|
|
|
|
|
|
// 使用 Prometheus
|
|
|
|
|
|
import prometheus from 'prom-client';
|
|
|
|
|
|
|
|
|
|
|
|
// 定义指标
|
|
|
|
|
|
const requestCounter = new prometheus.Counter({
|
|
|
|
|
|
name: 'http_requests_total',
|
|
|
|
|
|
help: 'Total number of HTTP requests',
|
|
|
|
|
|
labelNames: ['method', 'endpoint', 'status']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
const requestDuration = new prometheus.Histogram({
|
|
|
|
|
|
name: 'http_request_duration_seconds',
|
|
|
|
|
|
help: 'HTTP request duration in seconds',
|
|
|
|
|
|
labelNames: ['method', 'endpoint'],
|
|
|
|
|
|
buckets: [0.1, 0.5, 1, 2, 5]
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
// 中间件
|
|
|
|
|
|
const metricsMiddleware = (req, res, next) => {
|
|
|
|
|
|
const start = process.hrtime();
|
|
|
|
|
|
|
|
|
|
|
|
res.on('finish', () => {
|
|
|
|
|
|
const duration = process.hrtime(start);
|
|
|
|
|
|
const seconds = duration[0] + duration[1] / 1e9;
|
|
|
|
|
|
|
|
|
|
|
|
requestCounter.inc({
|
|
|
|
|
|
method: req.method,
|
|
|
|
|
|
endpoint: req.path,
|
|
|
|
|
|
status: res.statusCode
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
requestDuration.observe({
|
|
|
|
|
|
method: req.method,
|
|
|
|
|
|
endpoint: req.path
|
|
|
|
|
|
}, seconds);
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
next();
|
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
// 使用
|
|
|
|
|
|
app.use(metricsMiddleware);
|
|
|
|
|
|
app.get('/metrics', (req, res) => {
|
|
|
|
|
|
res.set('Content-Type', prometheus.register.contentType);
|
|
|
|
|
|
res.end(prometheus.register.metrics());
|
|
|
|
|
|
});
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4.3 最佳实践
|
|
|
|
|
|
|
|
|
|
|
|
- **定义关键指标**:根据业务需求定义关键指标
|
|
|
|
|
|
- **设置合理的告警**:为重要指标设置告警
|
|
|
|
|
|
- **可视化**:使用 Grafana 等工具可视化指标
|
|
|
|
|
|
- **定期分析**:定期分析指标数据,发现问题和优化机会
|
|
|
|
|
|
- **集成业务指标**:将业务指标与技术指标结合
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 5. 可观测性集成
|
|
|
|
|
|
|
|
|
|
|
|
### 5.1 日志、追踪和指标的关联
|
|
|
|
|
|
|
|
|
|
|
|
- **使用相同的 traceId**:将日志、追踪和指标关联起来
|
|
|
|
|
|
- **统一的上下文**:在所有可观测性数据中包含相同的上下文信息
|
|
|
|
|
|
- **集成平台**:使用 ELK Stack、Grafana 等平台集成所有可观测性数据
|
|
|
|
|
|
|
|
|
|
|
|
### 5.2 监控仪表盘
|
|
|
|
|
|
|
|
|
|
|
|
**推荐仪表盘**:
|
|
|
|
|
|
- **系统健康仪表盘**:显示系统的整体健康状态
|
|
|
|
|
|
- **服务性能仪表盘**:显示各个服务的性能指标
|
|
|
|
|
|
- **业务指标仪表盘**:显示业务相关的指标
|
|
|
|
|
|
- **告警仪表盘**:显示当前的告警状态
|
|
|
|
|
|
|
|
|
|
|
|
### 5.3 告警策略
|
|
|
|
|
|
|
|
|
|
|
|
- **设置合理的阈值**:根据业务需求设置合理的告警阈值
|
|
|
|
|
|
- **分级告警**:根据问题的严重程度设置不同级别的告警
|
|
|
|
|
|
- **告警通知**:通过邮件、短信、Slack 等渠道发送告警通知
|
|
|
|
|
|
- **告警抑制**:避免告警风暴
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 6. 最佳实践
|
|
|
|
|
|
|
|
|
|
|
|
### 6.1 设计原则
|
|
|
|
|
|
|
|
|
|
|
|
- **可观测性优先**:在系统设计阶段就考虑可观测性
|
|
|
|
|
|
- **统一标准**:所有服务使用相同的可观测性标准
|
|
|
|
|
|
- **适度采集**:采集足够的数据,但不过度采集
|
|
|
|
|
|
- **数据保留**:设置合理的数据保留策略
|
|
|
|
|
|
|
|
|
|
|
|
### 6.2 实现建议
|
|
|
|
|
|
|
|
|
|
|
|
- **使用开源工具**:如 ELK Stack、Prometheus、Grafana、Jaeger 等
|
|
|
|
|
|
- **自动化部署**:自动化可观测性工具的部署和配置
|
|
|
|
|
|
- **定期演练**:定期演练故障排查和恢复流程
|
|
|
|
|
|
- **持续优化**:根据实际情况持续优化可观测性方案
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 7. 相关文档
|
|
|
|
|
|
|
|
|
|
|
|
- [Service_Design.md](./Service_Design.md)
|
|
|
|
|
|
- [Data_Consistency.md](./Data_Consistency.md)
|
|
|
|
|
|
- [Architecture_Overview.md](./Architecture_Overview.md)
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-03-23 12:41:35 +08:00
|
|
|
|
*本文档基于服务设计文档,最后更新: 2026-03-18*
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 8. 商品中心监控指标(Product Center Metrics)
|
|
|
|
|
|
|
|
|
|
|
|
> **设计原则**: 覆盖商品管理、价格策略、SKU映射、库存同步等核心业务指标
|
|
|
|
|
|
|
|
|
|
|
|
### 8.1 商品管理指标
|
|
|
|
|
|
|
|
|
|
|
|
| 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 |
|
|
|
|
|
|
|----------|----------|----------|----------|----------|
|
|
|
|
|
|
| **SPU总数** | COUNT(cf_spu) | cf_spu表 | 每日 | - |
|
|
|
|
|
|
| **SKU总数** | COUNT(cf_sku) | cf_sku表 | 每日 | - |
|
|
|
|
|
|
| **Listing总数** | COUNT(cf_platform_listing) | cf_platform_listing表 | 每日 | - |
|
|
|
|
|
|
| **SKU变体密度** | SKU数 / SPU数 | cf_sku, cf_spu | 每周 | >10 预警 |
|
|
|
|
|
|
| **映射覆盖率** | 已映射SKU数 / SKU总数 × 100% | cf_sku_mapping | 每日 | <80% 预警 |
|
|
|
|
|
|
| **Listing活跃率** | ACTIVE状态Listing数 / Listing总数 × 100% | cf_platform_listing | 每日 | <70% 预警 |
|
|
|
|
|
|
|
|
|
|
|
|
### 8.2 价格管理指标
|
|
|
|
|
|
|
|
|
|
|
|
| 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 |
|
|
|
|
|
|
|----------|----------|----------|----------|----------|
|
|
|
|
|
|
| **价格策略覆盖率** | 已应用策略的Listing数 / Listing总数 × 100% | cf_price_strategy, cf_platform_listing | 每日 | <60% 预警 |
|
|
|
|
|
|
| **价格异常率** | 价格偏离基准价±50%的Listing数 / Listing总数 | cf_platform_listing, cf_sku | 每日 | >5% 预警 |
|
|
|
|
|
|
| **利润率分布** | 按利润率区间统计Listing数量 | cf_platform_listing | 每日 | B2C<20%或B2B<15%预警 |
|
|
|
|
|
|
| **价格同步成功率** | 同步成功次数 / 总同步次数 × 100% | 同步日志 | 每小时 | <95% 预警 |
|
|
|
|
|
|
| **AI调价采纳率** | 采纳AI建议的调价次数 / AI建议总次数 × 100% | AI决策日志 | 每周 | - |
|
|
|
|
|
|
|
|
|
|
|
|
### 8.3 库存同步指标
|
|
|
|
|
|
|
|
|
|
|
|
| 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 |
|
|
|
|
|
|
|----------|----------|----------|----------|----------|
|
|
|
|
|
|
| **库存同步延迟** | 平台库存更新时间 - 系统库存变更时间 | 同步日志 | 实时 | >5分钟 预警 |
|
|
|
|
|
|
| **库存差异率** | 平台库存与系统库存不一致的SKU数 / SKU总数 | 库存快照 | 每小时 | >2% 预警 |
|
|
|
|
|
|
| **超卖风险SKU数** | 平台库存 > 系统可用库存的SKU数 | 库存对比 | 实时 | >0 预警 |
|
|
|
|
|
|
| **库存预警SKU数** | 库存 < 安全库存的SKU数 | cf_sku | 每日 | - |
|
|
|
|
|
|
| **缺货SKU数** | 库存 = 0 的SKU数 | cf_sku | 实时 | - |
|
|
|
|
|
|
|
|
|
|
|
|
### 8.4 权限管理指标
|
|
|
|
|
|
|
|
|
|
|
|
| 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 |
|
|
|
|
|
|
|----------|----------|----------|----------|----------|
|
|
|
|
|
|
| **授权活跃率** | ACTIVE状态授权数 / 总授权数 × 100% | cf_shop_authorization | 每日 | <80% 预警 |
|
|
|
|
|
|
| **即将过期授权数** | 7天内将过期的授权数 | cf_shop_authorization | 每日 | >0 预警 |
|
|
|
|
|
|
| **授权失败率** | 授权失败次数 / 总授权尝试次数 × 100% | 授权日志 | 每小时 | >10% 预警 |
|
|
|
|
|
|
| **API配额使用率** | 已用配额 / 总配额 × 100% | cf_shop_authorization | 每小时 | >80% 预警 |
|
|
|
|
|
|
|
|
|
|
|
|
### 8.5 授权管理指标
|
|
|
|
|
|
|
|
|
|
|
|
| 指标名称 | 计算公式 | 数据来源 | 监控频率 | 预警阈值 |
|
|
|
|
|
|
|----------|----------|----------|----------|----------|
|
|
|
|
|
|
| **店铺授权覆盖率** | 已授权店铺数 / 总店铺数 × 100% | cf_shop, cf_shop_authorization | 每日 | <90% 预警 |
|
|
|
|
|
|
| **OAuth授权成功率** | OAuth成功次数 / OAuth尝试次数 × 100% | 授权日志 | 每小时 | <95% 预警 |
|
|
|
|
|
|
| **Agent授权成功率** | Agent成功次数 / Agent尝试次数 × 100% | 授权日志 | 每小时 | <90% 预警 |
|
|
|
|
|
|
| **授权刷新成功率** | Token刷新成功次数 / 刷新尝试次数 × 100% | 授权日志 | 每小时 | <95% 预警 |
|
|
|
|
|
|
|
|
|
|
|
|
### 8.6 核心业务指标
|
|
|
|
|
|
|
|
|
|
|
|
| 指标名称 | 计算公式 | 数据来源 | 监控频率 | 说明 |
|
|
|
|
|
|
|----------|----------|----------|----------|------|
|
|
|
|
|
|
| **跨平台商品覆盖率** | 在≥2个平台有Listing的SKU数 / SKU总数 | cf_platform_listing | 每周 | 衡量多平台运营能力 |
|
|
|
|
|
|
| **平均利润率** | SUM(利润) / SUM(销售额) × 100% | 订单数据 | 每日 | 核心盈利指标 |
|
|
|
|
|
|
| **商品转化率** | 有订单的Listing数 / 总Listing数 × 100% | 订单数据 | 每周 | 衡量商品竞争力 |
|
|
|
|
|
|
| **AI调价效果** | 调价后利润率 - 调价前利润率 | AI决策日志 | 每周 | 衡量AI调价效果 |
|
|
|
|
|
|
|
|
|
|
|
|
### 8.7 Prometheus指标定义
|
|
|
|
|
|
|
|
|
|
|
|
```typescript
|
|
|
|
|
|
// 商品管理指标
|
|
|
|
|
|
const spuTotalGauge = new Gauge({
|
|
|
|
|
|
name: 'product_spu_total',
|
|
|
|
|
|
help: 'Total number of SPU',
|
|
|
|
|
|
labelNames: ['tenant_id', 'status']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
const skuTotalGauge = new Gauge({
|
|
|
|
|
|
name: 'product_sku_total',
|
|
|
|
|
|
help: 'Total number of SKU',
|
|
|
|
|
|
labelNames: ['tenant_id', 'status']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
const listingTotalGauge = new Gauge({
|
|
|
|
|
|
name: 'product_listing_total',
|
|
|
|
|
|
help: 'Total number of Platform Listing',
|
|
|
|
|
|
labelNames: ['tenant_id', 'platform', 'status']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
// 价格管理指标
|
|
|
|
|
|
const priceSyncCounter = new Counter({
|
|
|
|
|
|
name: 'price_sync_total',
|
|
|
|
|
|
help: 'Total number of price sync operations',
|
|
|
|
|
|
labelNames: ['tenant_id', 'platform', 'status']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
const priceSyncDuration = new Histogram({
|
|
|
|
|
|
name: 'price_sync_duration_seconds',
|
|
|
|
|
|
help: 'Duration of price sync operations',
|
|
|
|
|
|
labelNames: ['tenant_id', 'platform'],
|
|
|
|
|
|
buckets: [0.1, 0.5, 1, 2, 5, 10]
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
// 库存同步指标
|
|
|
|
|
|
const inventorySyncCounter = new Counter({
|
|
|
|
|
|
name: 'inventory_sync_total',
|
|
|
|
|
|
help: 'Total number of inventory sync operations',
|
|
|
|
|
|
labelNames: ['tenant_id', 'platform', 'status']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
const inventoryDiscrepancyGauge = new Gauge({
|
|
|
|
|
|
name: 'inventory_discrepancy_count',
|
|
|
|
|
|
help: 'Number of SKUs with inventory discrepancy',
|
|
|
|
|
|
labelNames: ['tenant_id', 'platform']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
// 授权管理指标
|
|
|
|
|
|
const authStatusGauge = new Gauge({
|
|
|
|
|
|
name: 'shop_auth_status',
|
|
|
|
|
|
help: 'Shop authorization status (1=active, 0=inactive)',
|
|
|
|
|
|
labelNames: ['tenant_id', 'shop_id', 'platform', 'auth_type']
|
|
|
|
|
|
});
|
|
|
|
|
|
|
|
|
|
|
|
const authExpiryGauge = new Gauge({
|
|
|
|
|
|
name: 'shop_auth_expiry_seconds',
|
|
|
|
|
|
help: 'Seconds until shop authorization expires',
|
|
|
|
|
|
labelNames: ['tenant_id', 'shop_id', 'platform']
|
|
|
|
|
|
});
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 8.8 告警规则配置
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
|
# Prometheus告警规则
|
|
|
|
|
|
groups:
|
|
|
|
|
|
- name: product_center_alerts
|
|
|
|
|
|
rules:
|
|
|
|
|
|
# 商品管理告警
|
|
|
|
|
|
- alert: LowMappingCoverage
|
|
|
|
|
|
expr: product_mapping_coverage_ratio < 0.8
|
|
|
|
|
|
for: 1h
|
|
|
|
|
|
labels:
|
|
|
|
|
|
severity: warning
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
summary: "SKU映射覆盖率过低"
|
|
|
|
|
|
description: "SKU映射覆盖率低于80%,当前值: {{ $value }}"
|
|
|
|
|
|
|
|
|
|
|
|
# 价格管理告警
|
|
|
|
|
|
- alert: HighPriceAnomalyRate
|
|
|
|
|
|
expr: price_anomaly_ratio > 0.05
|
|
|
|
|
|
for: 30m
|
|
|
|
|
|
labels:
|
|
|
|
|
|
severity: warning
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
summary: "价格异常率过高"
|
|
|
|
|
|
description: "价格异常率超过5%,当前值: {{ $value }}"
|
|
|
|
|
|
|
|
|
|
|
|
# 库存同步告警
|
|
|
|
|
|
- alert: InventorySyncDelay
|
|
|
|
|
|
expr: inventory_sync_delay_seconds > 300
|
|
|
|
|
|
for: 5m
|
|
|
|
|
|
labels:
|
|
|
|
|
|
severity: critical
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
summary: "库存同步延迟过高"
|
|
|
|
|
|
description: "库存同步延迟超过5分钟,当前值: {{ $value }}秒"
|
|
|
|
|
|
|
|
|
|
|
|
- alert: OversellRisk
|
|
|
|
|
|
expr: oversell_risk_sku_count > 0
|
|
|
|
|
|
for: 1m
|
|
|
|
|
|
labels:
|
|
|
|
|
|
severity: critical
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
summary: "存在超卖风险"
|
|
|
|
|
|
description: "发现{{ $value }}个SKU存在超卖风险"
|
|
|
|
|
|
|
|
|
|
|
|
# 授权管理告警
|
|
|
|
|
|
- alert: AuthExpiringSoon
|
|
|
|
|
|
expr: shop_auth_expiry_seconds < 604800
|
|
|
|
|
|
for: 1h
|
|
|
|
|
|
labels:
|
|
|
|
|
|
severity: warning
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
summary: "店铺授权即将过期"
|
|
|
|
|
|
description: "店铺授权将在7天内过期"
|
|
|
|
|
|
|
|
|
|
|
|
- alert: HighAuthFailureRate
|
|
|
|
|
|
expr: auth_failure_rate > 0.1
|
|
|
|
|
|
for: 30m
|
|
|
|
|
|
labels:
|
|
|
|
|
|
severity: warning
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
summary: "授权失败率过高"
|
|
|
|
|
|
description: "授权失败率超过10%,当前值: {{ $value }}"
|
|
|
|
|
|
```
|