RevoU Python Project
Project Summary
Improving the Understanding of Python function to help read, gather, manipulate, and analyze the data and generate useful insights to drive an informed decision-making process, and this 3 main things that I did in this project :
- Conducted data cleaning and preparation such as removing irrelevant values, handling missing data, removing duplicates, handling outlier, string manipulation,time series manipulation, combine & merge datasets and feature engineering.
- Conducted exploratory data such as data aggregation, descriptive statistic, analyzing using visualization
- Conducted Cluster Analysis such as created user segmentation using k-means clustering, cluster interpretation, recommendation based on cluster result
Project Files
For a more comprehensive analysis and visualization, please open the project files.Project Background, Data scope, Dataset & Situation
Project Background
Python is the most challenging yet also exciting data programming languages. In this assignment we practiced python skills such as data cleaning and exploratory data analysis using python code. Also, we practiced more advanced python skills such as user segmentation using cluster analysis. We used Google Collab as a python notebook tools.
Data Scope
For this assignment we used data from kaggle and provided by olist, brazilian e-commerce company Customers Dataset. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.
Dataset
1. Order_dataset
Provided by RevoU faculty drive as csv. order_dataset consist of datetime summary of their transaction in company.
Order Data Definition :
- order_id : Order's unique identifier.
- customer_id : Customer's unique identifier
- order_status : Customer's order status
- order_purchase_timestamp : Customer's timestamp order
- order_approved_at : Customer's approved order time
- order_delivered_carrier_date : Time of Order's delivered to carrier
- order_delivered_customer_date : Time of Order's delivered to customer.
2. Customers_dataset
Provided by RevoU faculty drive as csv which consist of order id.The Customer Data consist of payment summary of company's customers.
Customer Data Definition:
- customer_id : Customer's unique identifier
- customer_unique_id : Customer's unique unique identifier
- customer_zip_code_prefix : Customer's zip postal code
- customer_city : Customer's city
- customer_state : Customer's state
3. Order_payments_dataset
Provided by RevoU faculty drive as csv which consist of customer id. The Order Paymentconsist of company's customers data summary.
Order Payment Data Definition:
- order_id : Order's unique identifier
- payment_sequential : Customer's payment sequential
- payment_type : Customer's payment type
- payment_installments: Customer's payment installments
- payment_value : Customer's payment value
Situation
Gustavo just started an e-commerce startup based in Brazil that recently opened an online website to sell their product. Fortunately, Gustavo is launcing their website when the covid-19 hits and making them grow faster than ever. However, Gustavo is still not using targeted marketing which hurts their marketing budget as only a fraction of their user comes back to their website. Gustavo needs your help to increase their marketing conversion rate by doing more targeted marketing using customer segmentation so that it will not hurt their budget.
Data Analysis
Note : Only Descriptive Analytics,EDA & Visualization,Customer Segmentation shown to simplify the analysis explanation. For further detail Check Google Colab File in Project file section above
Descriptive Analytics
payment_instalment's data has 2.73 point away from its mean, and payment_value's data has 177 point away from its
mean
payment_instalment is 1, and Q1 of payment_value is 61.82
105.01
payment_instalment is 4, and Q3 of payment_value is 175.78
value is more than 1 which mean the chart will be highly skewed. Payment_value's column has positive 4.22 which
mean the chart is positive highly skewed
number is from the mean and from every other number in the set. Payment_instalment's data has 7.45 point, normaly
std is square root of var, the var point is below std square of payment_installment which mean the data will be narrow
mean and each other. Payment_value's data has 31,328.15 point, its far exceed std square of payment_value which
mean the data will spread far from mean and distribution chart will be more broad
peaked (Leptokurtic), and Payment_value's data has positive 25.15 point which mean the distribution chart data would
be so peaked (Leptokurtic)
which mean the data has greater the level of dispersion around the mean. Payment_value's has 114.53%, its more then
100%, and its simply means that the standard deviation exceed the mean value
Exploratory Data Analysis (EDA)
Monthly total order
The Number of Order Per Month Histogram tell us that the total order month by month is gradualy growing from October 2016 and reaching its most peak point at November 2017. November 2017 is recorded too as the highest growth rate month. But after reach its peak, the growth in number of orders per moonth tends to stagnate till the last recorded data on August 2018.
Busiest day of week
The Busiest Day of Week Histogram tell us that trend orders mostly occur in the Midweek (Midweek is the middle of the week that roughly the period from Tuesday to Thursday), with Tuesday setting a record with the day with the most orders
Percentage payment type users
According to the Payment Type pie chart, Gustavo's customers are dominated by credit cards users with 75.9% (or 73,896 total users) following by boleto with 19.9% (or 19,376 total users), voucher 2.69% (or 2,615 total users), and debit card 1.54% (or 1,501 total users)
Delivery Performance
The data below is an alert information for Mr. Gustavo to prepare preventive measures to avoid the anger of customers whose orders are late. The delay data can be used for evaluation, which courier has problems or is often late, restructuring inventory warehouse, and workflow
According to the Delivery Performance pie chart below, Gustavo's performance recorded 7.91% or 7114 order late during these almost 2 year dataset record
Late Orders
Filled line chart below describe delays in delivery orders often occur during peak season periods
Monthly sales growth
In the sales growth chart below, there is one point that can be highlighted for evaluation, in December 2017, where sales fell -22.32% from the previous month. Gustavo and the team can dig up data on what happened that month and prepare preventive steps (such as a new year promo campaign, flash sale, bundling, or preparing special products) so that it doesn't happen again in December 2018.
Customer Segmentation
Segmentation by Customer Geography
Pareto chart above explain 80/20 rules, that 80% of customer comes from 20% states (6 cities states) from total 27 state, that is Sao Paulo (SP), Rio de Janeiro (RJ), MInas Gerais (MG), Rio Grande do Sul (RS), Parana (PR), Santa Catarina (SC), Bahia (BA). Using 80/20 rules simply means that most of Mr.Gustavo business around 80% comes from customers from 6 states above.
It can be use to remind Mr.Gustavo about his time management when managing marketing. Mr.Gustavo should focus his time on areas where 80% of customers are there (marketing campaign to stabilize or increase retention should be focused on 20% states above)
On the other hand, Mr.Gustavo can use this chart for insight to boosting advertising or pushing his marketing team to penetrate market and increase sales outside of those states, if he has more marketing funds.
Segmentation by Customer Behaviour
The elbow method show 4 clusters is the optimal cluster number,and The silhouette analysis also show 2 clusters is the optimal cluster number
we use 4 cluster form elbow to se 4 cluster to make it more varied
On spender type above we can see that Gustavo's customer is dominated by 59.2% low spender, folowing by lower-mid spender with 26.2%, upper-mid spender 11.4%, and high spender 3.13%.In terms of advertising/campaign needs, while focusing outside 80% state which dominate the order, Gustavo can also focusing target to specific demographics by their spending type customers in each states
Recommendation
- The market predominantly populated with middle class, using discount or promo would be helpful to boost sales.
- The real sultans also have significant amount of population, and these people tend to be more loyal and come back to buy more. Increasing benefit or creating customer loyalty program campaign would be helpful to retain these customers since they are can be quite a money maker.