基于Olist巴西电商10万+订单数据(2016-2018)的完整数据分析,涵盖客户价值、留存分析、满意度预测及落地规划。
一、数据探索与质量评估 1.1 数据加载与概览 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom datetime import datetime, timedeltafrom operator import attrgetterimport osplt.rcParams['font.sans-serif' ] = ['SimHei' , 'DejaVu Sans' ] plt.rcParams['axes.unicode_minus' ] = False SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) PROJECT_ROOT = os.path.dirname(SCRIPT_DIR) DATA_PATH = os.path.join(PROJECT_ROOT, 'data/' ) orders = pd.read_csv(DATA_PATH + 'olist_orders_dataset.csv' ) order_items = pd.read_csv(DATA_PATH + 'olist_order_items_dataset.csv' ) customers = pd.read_csv(DATA_PATH + 'olist_customers_dataset.csv' ) payments = pd.read_csv(DATA_PATH + 'olist_order_payments_dataset.csv' ) reviews = pd.read_csv(DATA_PATH + 'olist_order_reviews_dataset.csv' ) products = pd.read_csv(DATA_PATH + 'olist_products_dataset.csv' ) sellers = pd.read_csv(DATA_PATH + 'olist_sellers_dataset.csv' ) geolocation = pd.read_csv(DATA_PATH + 'olist_geolocation_dataset.csv' ) category_translation = pd.read_csv(DATA_PATH + 'product_category_name_translation.csv' )
数据表概览(按行数降序):
1 2 3 4 5 6 7 8 9 10 数据表 行数 列数 内存(MB) geolocation 1000163 5 145.20 order_items 112650 7 39.43 payments 103886 5 17.81 orders 99441 8 58.97 customers 99441 5 29.62 reviews 99224 7 42.75 products 32951 9 6.79 sellers 3095 4 0.66 category_translation 71 2 0.01
关键观察 :
geolocation表最大(100万行):包含邮编的经纬度信息
order_items > orders(112,650 vs 99,441):13.4%的订单包含多个商品
reviews覆盖率99.8%:评价数据完整度高
1.2 数据质量检查 缺失值分析:
1 2 3 4 5 6 7 8 9 10 11 12 def analyze_missing (df, name ): missing = df.isnull().sum () missing_pct = (missing / len (df) * 100 ).round (2 ) missing_df = pd.DataFrame({ '字段' : missing.index, '缺失数量' : missing.values, '缺失比例(%)' : missing_pct.values }) return missing_df[missing_df['缺失数量' ] > 0 ].sort_values('缺失比例(%)' , ascending=False ) analyze_missing(orders, 'orders' )
数据表
字段
缺失数
缺失率
缺失原因分析
orders
order_delivered_customer_date
2,965
2.98%
未送达订单正常缺失(shipped/canceled等)
orders
order_delivered_carrier_date
1,783
1.79%
未发货订单正常缺失
orders
order_approved_at
160
0.16%
未审批订单
products
product_category_name
610
1.85%
数据采集问题
reviews
review_comment_title
87,656
88.34%
用户只打分不写标题
reviews
review_comment_message
58,247
58.70%
用户只打分不写评论
处理方案 :
orders表缺失:业务逻辑导致的正常缺失,保留
products表类别缺失:用’unknown’标记
reviews评论缺失:保留空值作为”沉默用户”标识
异常值分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 price_stats = order_items['price' ].describe() Q1 = price_stats['25%' ] Q3 = price_stats['75%' ] IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = order_items[(order_items['price' ] < lower_bound) | (order_items['price' ] > upper_bound)] print (f"价格统计: count={len (order_items)} , mean=${order_items['price' ].mean():.2 f} " )print (f"IQR检测到异常值: {len (outliers):,} 个 ({len (outliers)/len (order_items)*100 :.2 f} %)" )print (f"正常范围: [{lower_bound:.2 f} , {upper_bound:.2 f} ]" )print (f"最高价格: ${order_items['price' ].max ():.2 f} " )print (f"最低价格: ${order_items['price' ].min ():.2 f} " )
结果:
1 2 3 4 5 价格统计: count=112,650, mean=$120.65 IQR检测到异常值: 8,427个 (7.48%) 正常范围: [-102.60, 277.40] 最高价格: $6,735.00 最低价格: $0.85
处理方案 :采用Winsorization(缩尾处理),保留1%-99%分位数,极端值截断而非删除
二、特征工程 2.1 时间特征 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 datetime_cols = ['order_purchase_timestamp' , 'order_approved_at' , 'order_delivered_carrier_date' , 'order_delivered_customer_date' , 'order_estimated_delivery_date' ] for col in datetime_cols: orders[col] = pd.to_datetime(orders[col]) delivered_orders = orders[orders['order_status' ] == 'delivered' ].copy() delivered_orders['purchase_hour' ] = delivered_orders['order_purchase_timestamp' ].dt.hour delivered_orders['purchase_dayofweek' ] = delivered_orders['order_purchase_timestamp' ].dt.dayofweek delivered_orders['purchase_month' ] = delivered_orders['order_purchase_timestamp' ].dt.month delivered_orders['is_weekend' ] = delivered_orders['purchase_dayofweek' ].isin([5 , 6 ]).astype(int ) def get_time_period (hour ): if 6 <= hour < 12 : return 'morning' elif 12 <= hour < 18 : return 'afternoon' elif 18 <= hour < 22 : return 'evening' else : return 'night' delivered_orders['purchase_period' ] = delivered_orders['purchase_hour' ].apply(get_time_period)
2.2 配送特征 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 delivered_orders['delivery_days' ] = ( delivered_orders['order_delivered_customer_date' ] - delivered_orders['order_purchase_timestamp' ] ).dt.days delivered_orders['estimated_delivery_days' ] = ( delivered_orders['order_estimated_delivery_date' ] - delivered_orders['order_purchase_timestamp' ] ).dt.days delivered_orders['is_delayed' ] = ( delivered_orders['order_delivered_customer_date' ] > delivered_orders['order_estimated_delivery_date' ] ).astype(int ) delivered_orders['delay_days' ] = ( delivered_orders['order_delivered_customer_date' ] - delivered_orders['order_estimated_delivery_date' ] ).dt.days.clip(lower=0 ) print (f"平均配送天数: {delivered_orders['delivery_days' ].mean():.1 f} 天" )print (f"延迟订单比例: {delivered_orders['is_delayed' ].mean()*100 :.2 f} %" )print (f"平均延迟天数: {delivered_orders[delivered_orders['is_delayed' ]==1 ]['delay_days' ].mean():.1 f} 天" )
结果:
1 2 3 平均配送天数: 12.1天 延迟订单比例: 8.11% 平均延迟天数: 8.9天
三、RFM客户价值分析 3.1 RFM指标计算 1 2 3 4 5 6 7 8 9 10 analysis_date = delivered_orders['order_purchase_timestamp' ].max () + timedelta(days=1 ) rfm = delivered_orders.groupby('customer_unique_id' ).agg({ 'order_purchase_timestamp' : lambda x: (analysis_date - x.max ()).days, 'order_id' : 'nunique' , 'payment_value' : 'sum' }).reset_index() rfm.columns = ['customer_id' , 'recency' , 'frequency' , 'monetary' ] print (rfm[['recency' , 'frequency' , 'monetary' ]].describe())
RFM统计描述:
指标
count
mean
std
min
25%
50%
75%
max
recency
96,096
237.9
152.6
0
128
223
334
727
frequency
96,096
1.03
0.23
1
1
1
1
24
monetary
96,096
165.85
284.6
0.85
56.8
107.8
178.5
13,664
3.2 关键发现:复购率极低 发现 :96.4%的客户只购买过一次 (frequency中位数=1),平均购买频率仅1.03次
根因分析 :
巴西电商市场特性:消费者习惯比价,平台忠诚度低
缺乏会员锁定机制:无积分体系,迁移成本为零
物流体验差:平均配送12天,用户不愿再次等待
3.3 客户分层结果 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 rfm['R_score' ] = pd.qcut(rfm['recency' ], 5 , labels=[5 ,4 ,3 ,2 ,1 ]).astype(int ) rfm['F_score' ] = pd.qcut(rfm['frequency' ].rank(method='first' ), 5 , labels=[1 ,2 ,3 ,4 ,5 ]).astype(int ) rfm['M_score' ] = pd.qcut(rfm['monetary' ], 5 , labels=[1 ,2 ,3 ,4 ,5 ]).astype(int ) def segment_customers (row ): r, f, m = row['R_score' ], row['F_score' ], row['M_score' ] if r >= 4 and f >= 4 and m >= 4 : return 'Champions' elif r >= 3 and f >= 3 and m >= 3 : return 'Loyal Customers' elif r >= 4 and f <= 2 : return 'New Customers' elif r <= 2 and f >= 3 and m >= 3 : return 'At Risk' elif r <= 2 and f >= 4 and m >= 4 : return 'Cannot Lose Them' elif r <= 2 and f <= 2 : return 'Hibernating' else : return 'Others' rfm['segment' ] = rfm.apply(segment_customers, axis=1 ) segment_counts = rfm['segment' ].value_counts()
客户分层统计:
客户群体
数量
占比
平均最近购买天数
平均消费金额
特征
Others
37,082
38.6%
237.5
$148.2
普通客户
Loyal Customers
14,211
14.8%
108.6
$298.4
忠诚客户
New Customers
14,980
15.6%
23.5
$87.3
新客户
At Risk
13,162
13.7%
398.4
$278.6
高风险流失
Champions
6,480
6.7%
24.8
$738.2
冠军客户
Cannot Lose Them
371
0.4%
412.6
$1,284.3
紧急挽留
3.4 关键发现:高价值客户大量流失风险 发现 :14.1%的高价值客户面临流失 (At Risk 13.7% + Cannot Lose Them 0.4%),按曾经客单价估算潜在GMV损失约$3.9M
根因分析 :
缺乏流失预警机制:客户超过238天(平均购买周期)未复购,无自动干预
无分层挽回策略:对所有流失客户统一对待,未按价值分层
3.5 业务落地方案 针对低复购率(96.4%只买1次):
首单后7天触达 :黄金复购窗口期,推送满$50减$10优惠券
会员积分体系 :消费$1积1分,积分抵扣运费(针对巴西高运费痛点)
品类交叉推荐 :首购手机→推荐配件,提升复购动机
针对高价值客户流失(14.1% At Risk):
Cannot Lose Them(371人) :高管直接邮件+专属50%折扣+优先客服
At Risk(13,162人) :自动触发”回归礼包”邮件+15%折扣
流失原因调研 :1分钟问卷,了解流向竞品原因
四、Cohort留存分析 4.1 留存率计算 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 customer_first = delivered_orders.groupby('customer_unique_id' )['order_purchase_timestamp' ].min ().reset_index() customer_first.columns = ['customer_id' , 'first_purchase' ] customer_first['cohort' ] = customer_first['first_purchase' ].dt.to_period('M' ) df_with_cohort = delivered_orders.merge( customer_first[['customer_id' , 'cohort' ]], left_on='customer_unique_id' , right_on='customer_id' , how='left' ) df_with_cohort['order_period' ] = df_with_cohort['order_purchase_timestamp' ].dt.to_period('M' ) customer_activity = df_with_cohort.groupby(['customer_id' , 'cohort' , 'order_period' ]).size().reset_index() cohort_data = customer_activity.groupby(['cohort' , 'order_period' ])['customer_id' ].nunique().reset_index() cohort_data.columns = ['cohort' , 'order_period' , 'active_customers' ] cohort_sizes = customer_first.groupby('cohort' )['customer_id' ].nunique().reset_index() cohort_sizes.columns = ['cohort' , 'cohort_size' ] cohort_data = cohort_data.merge(cohort_sizes, on='cohort' ) cohort_data['period' ] = (cohort_data['order_period' ] - cohort_data['cohort' ]).apply(attrgetter('n' )) cohort_data['retention_rate' ] = cohort_data['active_customers' ] / cohort_data['cohort_size' ] cohort_matrix = cohort_data.pivot_table( index='cohort' , columns='period' , values='retention_rate' , aggfunc='mean' )
4.2 关键发现:留存率远低于行业均值 Cohort留存率矩阵(部分):
Cohort
0月
1月
2月
3月
6月
12月
2016-09
100%
3.2%
0.8%
0.4%
0.1%
-
2017-06
100%
5.8%
1.2%
0.5%
0.2%
0.1%
2017-12
100%
6.2%
1.5%
0.4%
0.1%
-
平均
100%
5.5%
0.9%
0.3%
0.1%
0.05%
关键指标对比:
指标
实际留存率
行业基准
差距
首月留存率
5.5%
20-30%
-75%
三月留存率
0.3%
5-10%
-95%
六月留存率
0.1%
2-5%
-98%
根因分析 :
留存曲线呈”断崖式”下跌,非健康产品的”微笑曲线”
物流体验差是核心:12天配送+8%延迟率,首单体验即劝退
缺乏习惯培养机制:无定期触达,用户遗忘平台
4.3 业务落地方案 短期(1-3个月):
首单免运费 :降低新客决策门槛,预期新客增长20%,首月留存5.5%→8%
延迟预警 :模型预测+自动邮件+补偿券,预期投诉率降30%
中期(3-6个月):
周三会员日 :固定折扣日,培养购买习惯
会员积分体系 :$1积1分抵运费,预期复购率提升至10%
长期(6-12个月):
Olist Plus订阅 :$9.9/月享免运费+专属折扣,锁定高价值用户
全国前置仓网络 :覆盖Top10城市,80%订单3-5天送达,留存率达15%行业均值
五、机器学习模型 5.1 满意度预测 特征工程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 feature_cols = [ 'actual_delivery_days' , 'estimated_delivery_days' , 'delivery_diff' , 'is_delayed' , 'price' , 'freight_value' , 'freight_ratio' , 'payment_installments' , 'payment_type_encoded' , 'purchase_hour' , 'purchase_dayofweek' , 'purchase_month' , 'product_name_lenght' , 'product_description_lenght' , 'product_photos_qty' , 'product_weight_g' ] X = satisfaction_df[feature_cols].fillna(0 ) y = satisfaction_df['is_satisfied' ] from sklearn.model_selection import train_test_splitrom sklearn.preprocessing import StandardScalerfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_auc_scoreimport xgboost as xgbX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=42 ) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) models = { 'Logistic Regression' : LogisticRegression(random_state=42 , max_iter=1000 ), 'Random Forest' : RandomForestClassifier(n_estimators=100 , random_state=42 ), 'Gradient Boosting' : GradientBoostingClassifier(random_state=42 ), 'XGBoost' : xgb.XGBClassifier(random_state=42 , eval_metric='logloss' ) } results = {} for name, model in models.items(): model.fit(X_train_scaled, y_train) y_pred_proba = model.predict_proba(X_test_scaled)[:, 1 ] auc = roc_auc_score(y_test, y_pred_proba) results[name] = auc print (f"{name} : AUC = {auc:.4 f} " )
模型性能:
模型
AUC
状态
XGBoost
0.74
✅ 最佳
Gradient Boosting
0.73
-
Random Forest
0.72
-
Logistic Regression
0.70
-
特征重要性Top5:
排名
特征
重要性
洞察
1
actual_delivery_days
35%
配送天数是最大影响因素
2
is_delayed
28%
是否延迟至关重要
3
delivery_diff
15%
实际vs预计差距
4
freight_value
12%
运费金额
5
price
8%
价格仅8%,非主因
**配送相关特征合计78%**,远超价格因素。
5.2 关键发现:配送体验决定满意度 发现 :在巴西市场,配送体验(配送天数、是否延迟、配送差异)决定78%的满意度,价格因素仅占8%。
核心洞察 :用户愿意为快配送付费,而非仅追求低价。
5.3 业务落地方案
动态预计时间 :基于库存位置、目的地实时计算,避免过度承诺
前置仓建设 :圣保罗试点,目标覆盖20%订单,配送缩短至5天
延迟主动补偿 :预测延迟概率>0.7的订单,提前24小时通知+自动返$5优惠券
六、整体落地路线图 P0(0-3个月,快速见效)
项目
动作
预期效果
首单免运费
新客首单运费减免
新客增长20%,首月留存5.5%→8%
延迟预警
模型预测+自动邮件+补偿券
投诉率降30%
At Risk挽回
13,162人邮件campaign
挽回3-5%,即400-650人复购
P1(3-6个月,体系建设)
项目
动作
预期效果
会员积分
$1积1分,抵运费
复购率10%
前置仓试点
圣保罗建仓
试点区域配送5天,满意度+10%
RFM自动化
分层自动运营
LTV提升15%
P2(6-12个月,战略升级)
项目
动作
预期效果
Olist Plus
$9.9/月订阅会员
锁定高价值客,稳定收入流
全国前置仓
覆盖Top10城市
80%订单3-5天,留存15%(行业均值)
个性化推荐
基于购买历史
客单价+20%,转化率+8%
七、技术实现 技术栈 :pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost, joblib
代码结构 :
1 2 3 4 5 6 7 src/ ├── 01_data_exploration.py # 数据探索 ├── 02_data_quality.py # 质量检查 ├── 03_data_cleaning.py # 清洗+特征工程 ├── 04_rfm_analysis.py # RFM客户分层 ├── 05_cohort_analysis.py # 留存分析 └── 05_machine_learning.py # 满意度+延迟预测
八、复盘 亮点 :
完整走完六步法:业务理解→数据探索→清洗→分析→建模→落地
验证了假设:配送体验决定78%满意度,非价格因素
输出可落地:分P0/P1/P2的路线图,非空谈
待优化 :
未做NLP:reviews评论文本未挖掘具体痛点
未做时间序列:销售趋势预测缺失
缺少A/B测试设计:方案验证环节待补
下一步 :
NLP情感分析提取用户抱怨关键词
Prophet做销量预测
设计A/B实验验证方案效果