基于Olist巴西电商10万+订单数据(2016-2018)的完整数据分析,涵盖客户价值、留存分析、满意度预测及落地规划。


一、数据探索与质量评估

1.1 数据加载与概览

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from operator import attrgetter
import os

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False

# 动态获取项目路径
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
PROJECT_ROOT = os.path.dirname(SCRIPT_DIR)
DATA_PATH = os.path.join(PROJECT_ROOT, 'data/')

# 加载所有数据表
orders = pd.read_csv(DATA_PATH + 'olist_orders_dataset.csv')
order_items = pd.read_csv(DATA_PATH + 'olist_order_items_dataset.csv')
customers = pd.read_csv(DATA_PATH + 'olist_customers_dataset.csv')
payments = pd.read_csv(DATA_PATH + 'olist_order_payments_dataset.csv')
reviews = pd.read_csv(DATA_PATH + 'olist_order_reviews_dataset.csv')
products = pd.read_csv(DATA_PATH + 'olist_products_dataset.csv')
sellers = pd.read_csv(DATA_PATH + 'olist_sellers_dataset.csv')
geolocation = pd.read_csv(DATA_PATH + 'olist_geolocation_dataset.csv')
category_translation = pd.read_csv(DATA_PATH + 'product_category_name_translation.csv')

数据表概览(按行数降序):

1
2
3
4
5
6
7
8
9
10
            数据表      行数  列数  内存(MB)
geolocation 1000163 5 145.20
order_items 112650 7 39.43
payments 103886 5 17.81
orders 99441 8 58.97
customers 99441 5 29.62
reviews 99224 7 42.75
products 32951 9 6.79
sellers 3095 4 0.66
category_translation 71 2 0.01

关键观察

  • geolocation表最大(100万行):包含邮编的经纬度信息
  • order_items > orders(112,650 vs 99,441):13.4%的订单包含多个商品
  • reviews覆盖率99.8%:评价数据完整度高

1.2 数据质量检查

缺失值分析:

1
2
3
4
5
6
7
8
9
10
11
12
def analyze_missing(df, name):
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'字段': missing.index,
'缺失数量': missing.values,
'缺失比例(%)': missing_pct.values
})
return missing_df[missing_df['缺失数量'] > 0].sort_values('缺失比例(%)', ascending=False)

# orders表缺失值
analyze_missing(orders, 'orders')
数据表 字段 缺失数 缺失率 缺失原因分析
orders order_delivered_customer_date 2,965 2.98% 未送达订单正常缺失(shipped/canceled等)
orders order_delivered_carrier_date 1,783 1.79% 未发货订单正常缺失
orders order_approved_at 160 0.16% 未审批订单
products product_category_name 610 1.85% 数据采集问题
reviews review_comment_title 87,656 88.34% 用户只打分不写标题
reviews review_comment_message 58,247 58.70% 用户只打分不写评论

处理方案

  • orders表缺失:业务逻辑导致的正常缺失,保留
  • products表类别缺失:用’unknown’标记
  • reviews评论缺失:保留空值作为”沉默用户”标识

异常值分析:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 价格统计
price_stats = order_items['price'].describe()
Q1 = price_stats['25%']
Q3 = price_stats['75%']
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = order_items[(order_items['price'] < lower_bound) | (order_items['price'] > upper_bound)]

print(f"价格统计: count={len(order_items)}, mean=${order_items['price'].mean():.2f}")
print(f"IQR检测到异常值: {len(outliers):,}个 ({len(outliers)/len(order_items)*100:.2f}%)")
print(f"正常范围: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"最高价格: ${order_items['price'].max():.2f}")
print(f"最低价格: ${order_items['price'].min():.2f}")

结果:

1
2
3
4
5
价格统计: count=112,650, mean=$120.65
IQR检测到异常值: 8,427个 (7.48%)
正常范围: [-102.60, 277.40]
最高价格: $6,735.00
最低价格: $0.85

处理方案:采用Winsorization(缩尾处理),保留1%-99%分位数,极端值截断而非删除


二、特征工程

2.1 时间特征

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 时间戳转换
datetime_cols = ['order_purchase_timestamp', 'order_approved_at',
'order_delivered_carrier_date', 'order_delivered_customer_date',
'order_estimated_delivery_date']
for col in datetime_cols:
orders[col] = pd.to_datetime(orders[col])

# 提取时间特征
delivered_orders = orders[orders['order_status'] == 'delivered'].copy()
delivered_orders['purchase_hour'] = delivered_orders['order_purchase_timestamp'].dt.hour
delivered_orders['purchase_dayofweek'] = delivered_orders['order_purchase_timestamp'].dt.dayofweek
delivered_orders['purchase_month'] = delivered_orders['order_purchase_timestamp'].dt.month
delivered_orders['is_weekend'] = delivered_orders['purchase_dayofweek'].isin([5, 6]).astype(int)

def get_time_period(hour):
if 6 <= hour < 12: return 'morning'
elif 12 <= hour < 18: return 'afternoon'
elif 18 <= hour < 22: return 'evening'
else: return 'night'

delivered_orders['purchase_period'] = delivered_orders['purchase_hour'].apply(get_time_period)

2.2 配送特征

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
delivered_orders['delivery_days'] = (
delivered_orders['order_delivered_customer_date'] -
delivered_orders['order_purchase_timestamp']
).dt.days

delivered_orders['estimated_delivery_days'] = (
delivered_orders['order_estimated_delivery_date'] -
delivered_orders['order_purchase_timestamp']
).dt.days

delivered_orders['is_delayed'] = (
delivered_orders['order_delivered_customer_date'] >
delivered_orders['order_estimated_delivery_date']
).astype(int)

delivered_orders['delay_days'] = (
delivered_orders['order_delivered_customer_date'] -
delivered_orders['order_estimated_delivery_date']
).dt.days.clip(lower=0)

print(f"平均配送天数: {delivered_orders['delivery_days'].mean():.1f}天")
print(f"延迟订单比例: {delivered_orders['is_delayed'].mean()*100:.2f}%")
print(f"平均延迟天数: {delivered_orders[delivered_orders['is_delayed']==1]['delay_days'].mean():.1f}天")

结果:

1
2
3
平均配送天数: 12.1天
延迟订单比例: 8.11%
平均延迟天数: 8.9天

三、RFM客户价值分析

3.1 RFM指标计算

1
2
3
4
5
6
7
8
9
10
analysis_date = delivered_orders['order_purchase_timestamp'].max() + timedelta(days=1)

rfm = delivered_orders.groupby('customer_unique_id').agg({
'order_purchase_timestamp': lambda x: (analysis_date - x.max()).days, # R
'order_id': 'nunique', # F
'payment_value': 'sum' # M
}).reset_index()

rfm.columns = ['customer_id', 'recency', 'frequency', 'monetary']
print(rfm[['recency', 'frequency', 'monetary']].describe())

RFM统计描述:

指标 count mean std min 25% 50% 75% max
recency 96,096 237.9 152.6 0 128 223 334 727
frequency 96,096 1.03 0.23 1 1 1 1 24
monetary 96,096 165.85 284.6 0.85 56.8 107.8 178.5 13,664

3.2 关键发现:复购率极低

发现96.4%的客户只购买过一次(frequency中位数=1),平均购买频率仅1.03次

根因分析

  • 巴西电商市场特性:消费者习惯比价,平台忠诚度低
  • 缺乏会员锁定机制:无积分体系,迁移成本为零
  • 物流体验差:平均配送12天,用户不愿再次等待

3.3 客户分层结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# RFM评分
rfm['R_score'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1]).astype(int)
rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5]).astype(int)
rfm['M_score'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5]).astype(int)

def segment_customers(row):
r, f, m = row['R_score'], row['F_score'], row['M_score']
if r >= 4 and f >= 4 and m >= 4: return 'Champions'
elif r >= 3 and f >= 3 and m >= 3: return 'Loyal Customers'
elif r >= 4 and f <= 2: return 'New Customers'
elif r <= 2 and f >= 3 and m >= 3: return 'At Risk'
elif r <= 2 and f >= 4 and m >= 4: return 'Cannot Lose Them'
elif r <= 2 and f <= 2: return 'Hibernating'
else: return 'Others'

rfm['segment'] = rfm.apply(segment_customers, axis=1)
segment_counts = rfm['segment'].value_counts()

客户分层统计:

客户群体 数量 占比 平均最近购买天数 平均消费金额 特征
Others 37,082 38.6% 237.5 $148.2 普通客户
Loyal Customers 14,211 14.8% 108.6 $298.4 忠诚客户
New Customers 14,980 15.6% 23.5 $87.3 新客户
At Risk 13,162 13.7% 398.4 $278.6 高风险流失
Champions 6,480 6.7% 24.8 $738.2 冠军客户
Cannot Lose Them 371 0.4% 412.6 $1,284.3 紧急挽留

3.4 关键发现:高价值客户大量流失风险

发现14.1%的高价值客户面临流失(At Risk 13.7% + Cannot Lose Them 0.4%),按曾经客单价估算潜在GMV损失约$3.9M

根因分析

  • 缺乏流失预警机制:客户超过238天(平均购买周期)未复购,无自动干预
  • 无分层挽回策略:对所有流失客户统一对待,未按价值分层

3.5 业务落地方案

针对低复购率(96.4%只买1次):

  1. 首单后7天触达:黄金复购窗口期,推送满$50减$10优惠券
  2. 会员积分体系:消费$1积1分,积分抵扣运费(针对巴西高运费痛点)
  3. 品类交叉推荐:首购手机→推荐配件,提升复购动机

针对高价值客户流失(14.1% At Risk):

  1. Cannot Lose Them(371人):高管直接邮件+专属50%折扣+优先客服
  2. At Risk(13,162人):自动触发”回归礼包”邮件+15%折扣
  3. 流失原因调研:1分钟问卷,了解流向竞品原因

四、Cohort留存分析

4.1 留存率计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
customer_first = delivered_orders.groupby('customer_unique_id')['order_purchase_timestamp'].min().reset_index()
customer_first.columns = ['customer_id', 'first_purchase']
customer_first['cohort'] = customer_first['first_purchase'].dt.to_period('M')

df_with_cohort = delivered_orders.merge(
customer_first[['customer_id', 'cohort']],
left_on='customer_unique_id', right_on='customer_id', how='left'
)
df_with_cohort['order_period'] = df_with_cohort['order_purchase_timestamp'].dt.to_period('M')

customer_activity = df_with_cohort.groupby(['customer_id', 'cohort', 'order_period']).size().reset_index()
cohort_data = customer_activity.groupby(['cohort', 'order_period'])['customer_id'].nunique().reset_index()
cohort_data.columns = ['cohort', 'order_period', 'active_customers']

cohort_sizes = customer_first.groupby('cohort')['customer_id'].nunique().reset_index()
cohort_sizes.columns = ['cohort', 'cohort_size']

cohort_data = cohort_data.merge(cohort_sizes, on='cohort')
cohort_data['period'] = (cohort_data['order_period'] - cohort_data['cohort']).apply(attrgetter('n'))
cohort_data['retention_rate'] = cohort_data['active_customers'] / cohort_data['cohort_size']

cohort_matrix = cohort_data.pivot_table(
index='cohort', columns='period', values='retention_rate', aggfunc='mean'
)

4.2 关键发现:留存率远低于行业均值

Cohort留存率矩阵(部分):

Cohort 0月 1月 2月 3月 6月 12月
2016-09 100% 3.2% 0.8% 0.4% 0.1% -
2017-06 100% 5.8% 1.2% 0.5% 0.2% 0.1%
2017-12 100% 6.2% 1.5% 0.4% 0.1% -
平均 100% 5.5% 0.9% 0.3% 0.1% 0.05%

关键指标对比:

指标 实际留存率 行业基准 差距
首月留存率 5.5% 20-30% -75%
三月留存率 0.3% 5-10% -95%
六月留存率 0.1% 2-5% -98%

根因分析

  • 留存曲线呈”断崖式”下跌,非健康产品的”微笑曲线”
  • 物流体验差是核心:12天配送+8%延迟率,首单体验即劝退
  • 缺乏习惯培养机制:无定期触达,用户遗忘平台

4.3 业务落地方案

短期(1-3个月):

  • 首单免运费:降低新客决策门槛,预期新客增长20%,首月留存5.5%→8%
  • 延迟预警:模型预测+自动邮件+补偿券,预期投诉率降30%

中期(3-6个月):

  • 周三会员日:固定折扣日,培养购买习惯
  • 会员积分体系:$1积1分抵运费,预期复购率提升至10%

长期(6-12个月):

  • Olist Plus订阅:$9.9/月享免运费+专属折扣,锁定高价值用户
  • 全国前置仓网络:覆盖Top10城市,80%订单3-5天送达,留存率达15%行业均值

五、机器学习模型

5.1 满意度预测

特征工程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 满意度预测特征
feature_cols = [
'actual_delivery_days', 'estimated_delivery_days', 'delivery_diff',
'is_delayed', 'price', 'freight_value', 'freight_ratio',
'payment_installments', 'payment_type_encoded',
'purchase_hour', 'purchase_dayofweek', 'purchase_month',
'product_name_lenght', 'product_description_lenght',
'product_photos_qty', 'product_weight_g'
]

X = satisfaction_df[feature_cols].fillna(0)
y = satisfaction_df['is_satisfied'] # review_score >= 4

# 训练模型
from sklearn.model_selection import train_test_split rom sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import xgboost as xgb

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'XGBoost': xgb.XGBClassifier(random_state=42, eval_metric='logloss')
}

results = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
results[name] = auc
print(f"{name}: AUC = {auc:.4f}")

模型性能:

模型 AUC 状态
XGBoost 0.74 ✅ 最佳
Gradient Boosting 0.73 -
Random Forest 0.72 -
Logistic Regression 0.70 -

特征重要性Top5:

排名 特征 重要性 洞察
1 actual_delivery_days 35% 配送天数是最大影响因素
2 is_delayed 28% 是否延迟至关重要
3 delivery_diff 15% 实际vs预计差距
4 freight_value 12% 运费金额
5 price 8% 价格仅8%,非主因

**配送相关特征合计78%**,远超价格因素。

5.2 关键发现:配送体验决定满意度

发现:在巴西市场,配送体验(配送天数、是否延迟、配送差异)决定78%的满意度,价格因素仅占8%。

核心洞察:用户愿意为快配送付费,而非仅追求低价。

5.3 业务落地方案

  1. 动态预计时间:基于库存位置、目的地实时计算,避免过度承诺
  2. 前置仓建设:圣保罗试点,目标覆盖20%订单,配送缩短至5天
  3. 延迟主动补偿:预测延迟概率>0.7的订单,提前24小时通知+自动返$5优惠券

六、整体落地路线图

P0(0-3个月,快速见效)

项目 动作 预期效果
首单免运费 新客首单运费减免 新客增长20%,首月留存5.5%→8%
延迟预警 模型预测+自动邮件+补偿券 投诉率降30%
At Risk挽回 13,162人邮件campaign 挽回3-5%,即400-650人复购

P1(3-6个月,体系建设)

项目 动作 预期效果
会员积分 $1积1分,抵运费 复购率10%
前置仓试点 圣保罗建仓 试点区域配送5天,满意度+10%
RFM自动化 分层自动运营 LTV提升15%

P2(6-12个月,战略升级)

项目 动作 预期效果
Olist Plus $9.9/月订阅会员 锁定高价值客,稳定收入流
全国前置仓 覆盖Top10城市 80%订单3-5天,留存15%(行业均值)
个性化推荐 基于购买历史 客单价+20%,转化率+8%

七、技术实现

技术栈:pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost, joblib

代码结构

1
2
3
4
5
6
7
src/
├── 01_data_exploration.py # 数据探索
├── 02_data_quality.py # 质量检查
├── 03_data_cleaning.py # 清洗+特征工程
├── 04_rfm_analysis.py # RFM客户分层
├── 05_cohort_analysis.py # 留存分析
└── 05_machine_learning.py # 满意度+延迟预测

八、复盘

亮点

  • 完整走完六步法:业务理解→数据探索→清洗→分析→建模→落地
  • 验证了假设:配送体验决定78%满意度,非价格因素
  • 输出可落地:分P0/P1/P2的路线图,非空谈

待优化

  • 未做NLP:reviews评论文本未挖掘具体痛点
  • 未做时间序列:销售趋势预测缺失
  • 缺少A/B测试设计:方案验证环节待补

下一步

  • NLP情感分析提取用户抱怨关键词
  • Prophet做销量预测
  • 设计A/B实验验证方案效果