RevoU Python Project

Project Summary

Improving the Understanding of Python function to help read, gather, manipulate, and analyze the data and generate useful insights to drive an informed decision-making process, and this 3 main things that I did in this project :

Conducted data cleaning and preparation such as removing irrelevant values, handling missing data, removing duplicates, handling outlier, string manipulation,time series manipulation, combine & merge datasets and feature engineering.
Conducted exploratory data such as data aggregation, descriptive statistic, analyzing using visualization
Conducted Cluster Analysis such as created user segmentation using k-means clustering, cluster interpretation, recommendation based on cluster result

Project Files

For a more comprehensive analysis and visualization, please open the project files.

Project Background, Data scope, Dataset & Situation

Project Background

Python is the most challenging yet also exciting data programming languages. In this assignment we practiced python skills such as data cleaning and exploratory data analysis using python code. Also, we practiced more advanced python skills such as user segmentation using cluster analysis. We used Google Collab as a python notebook tools.

Data Scope

For this assignment we used data from kaggle and provided by olist, brazilian e-commerce company Customers Dataset. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

Dataset

1. Order_dataset

Provided by RevoU faculty drive as csv. order_dataset consist of datetime summary of their transaction in company.
Order Data Definition :

order_id : Order's unique identifier.
customer_id : Customer's unique identifier
order_status : Customer's order status
order_purchase_timestamp : Customer's timestamp order
order_approved_at : Customer's approved order time
order_delivered_carrier_date : Time of Order's delivered to carrier
order_delivered_customer_date : Time of Order's delivered to customer.

2. Customers_dataset

Provided by RevoU faculty drive as csv which consist of order id.The Customer Data consist of payment summary of company's customers.
Customer Data Definition:

customer_id : Customer's unique identifier
customer_unique_id : Customer's unique unique identifier
customer_zip_code_prefix : Customer's zip postal code
customer_city : Customer's city
customer_state : Customer's state

3. Order_payments_dataset

Provided by RevoU faculty drive as csv which consist of customer id. The Order Paymentconsist of company's customers data summary.
Order Payment Data Definition:

order_id : Order's unique identifier
payment_sequential : Customer's payment sequential
payment_type : Customer's payment type
payment_installments: Customer's payment installments
payment_value : Customer's payment value

Situation

Gustavo just started an e-commerce startup based in Brazil that recently opened an online website to sell their product. Fortunately, Gustavo is launcing their website when the covid-19 hits and making them grow faster than ever. However, Gustavo is still not using targeted marketing which hurts their marketing budget as only a fraction of their user comes back to their website. Gustavo needs your help to increase their marketing conversion rate by doing more targeted marketing using customer segmentation so that it will not hurt their budget.

Data Analysis

Note : Only Descriptive Analytics,EDA & Visualization,Customer Segmentation shown to simplify the analysis explanation. For further detail Check Google Colab File in Project file section above

Descriptive Analytics

count : it contains the total amount of data in the column

mean : it contains mean of data in the column

std          : standard deviation is a way to measure how far the average value lies from the mean. On the data above means that
                      payment_instalment's data has 2.73 point away from its mean, and payment_value's data has 177 point away from its
                      mean

min : it contain minimum value of data in the column

max : it contain maximum value of data in the column

25% : 25% percentile is Q1 in the column, it contain the middle value of data between minimum value and median. Q1 for
payment_instalment is 1, and Q1 of payment_value is 61.82

50% : 50% percentile is the median of data in the column. Median for payment_instalment is 2, and Median for payment_value
105.01

75% : 75% percentile is Q3 in the column, it contain the middle value of data between median and maximum value. Q3 for
payment_instalment is 4, and Q3 of payment_value is 175.78

skew       : payment_instalment's column has skew value 1.64, its positive value so it will be a positive skewness chart, and the
                      value is more than 1 which mean the chart will be highly skewed. Payment_value's column has positive 4.22 which
                      mean the chart is positive highly skewed

var          : Its talking about the spread between numbers in a data set. Variance is a measurement to determine how far each
                      number is from the mean and from every other number in the set. Payment_instalment's data has 7.45 point, normaly
                      std is square root of var, the var point is below std square of payment_installment which mean the data will be narrow
                      mean and each other. Payment_value's data has 31,328.15 point, its far exceed std square of payment_value which
                      mean the data will spread far from mean and distribution chart will be more broad

kurt         : Payment_instalment's has positive 2.77 point of kurtosis, its exceed +1 which mean the distribution chart data would be
                      peaked (Leptokurtic), and Payment_value's data has positive 25.15 point which mean the distribution chart data would
                      be so peaked (Leptokurtic)

0 : Its contain mode on each column

CV          : coefficient of variation is the ratio between the standard deviation and the mean. Payment_instalment's has 91.86% cov
                      which mean the data has greater the level of dispersion around the mean. Payment_value's has 114.53%, its more then
                      100%, and its simply means that the standard deviation exceed the mean value

range : its contain the range of data on each column

Exploratory Data Analysis (EDA)

Monthly total order

The Number of Order Per Month Histogram tell us that the total order month by month is gradualy growing from October 2016 and reaching its most peak point at November 2017. November 2017 is recorded too as the highest growth rate month. But after reach its peak, the growth in number of orders per moonth tends to stagnate till the last recorded data on August 2018.

Busiest day of week

The Busiest Day of Week Histogram tell us that trend orders mostly occur in the Midweek (Midweek is the middle of the week that roughly the period from Tuesday to Thursday), with Tuesday setting a record with the day with the most orders

Percentage payment type users

According to the Payment Type pie chart, Gustavo's customers are dominated by credit cards users with 75.9% (or 73,896 total users) following by boleto with 19.9% (or 19,376 total users), voucher 2.69% (or 2,615 total users), and debit card 1.54% (or 1,501 total users)

Delivery Performance

The data below is an alert information for Mr. Gustavo to prepare preventive measures to avoid the anger of customers whose orders are late. The delay data can be used for evaluation, which courier has problems or is often late, restructuring inventory warehouse, and workflow
According to the Delivery Performance pie chart below, Gustavo's performance recorded 7.91% or 7114 order late during these almost 2 year dataset record

Late Orders

Filled line chart below describe delays in delivery orders often occur during peak season periods

Monthly sales growth

In the sales growth chart below, there is one point that can be highlighted for evaluation, in December 2017, where sales fell -22.32% from the previous month. Gustavo and the team can dig up data on what happened that month and prepare preventive steps (such as a new year promo campaign, flash sale, bundling, or preparing special products) so that it doesn't happen again in December 2018.

Customer Segmentation

Segmentation by Customer Geography

Pareto chart above explain 80/20 rules, that 80% of customer comes from 20% states (6 cities states) from total 27 state, that is Sao Paulo (SP), Rio de Janeiro (RJ), MInas Gerais (MG), Rio Grande do Sul (RS), Parana (PR), Santa Catarina (SC), Bahia (BA). Using 80/20 rules simply means that most of Mr.Gustavo business around 80% comes from customers from 6 states above.
It can be use to remind Mr.Gustavo about his time management when managing marketing. Mr.Gustavo should focus his time on areas where 80% of customers are there (marketing campaign to stabilize or increase retention should be focused on 20% states above)
On the other hand, Mr.Gustavo can use this chart for insight to boosting advertising or pushing his marketing team to penetrate market and increase sales outside of those states, if he has more marketing funds.

Segmentation by Customer Behaviour

The elbow method show 4 clusters is the optimal cluster number,and The silhouette analysis also show 2 clusters is the optimal cluster number
we use 4 cluster form elbow to se 4 cluster to make it more varied

On spender type above we can see that Gustavo's customer is dominated by 59.2% low spender, folowing by lower-mid spender with 26.2%, upper-mid spender 11.4%, and high spender 3.13%.In terms of advertising/campaign needs, while focusing outside 80% state which dominate the order, Gustavo can also focusing target to specific demographics by their spending type customers in each states

Recommendation

The market predominantly populated with middle class, using discount or promo would be helpful to boost sales.
The real sultans also have significant amount of population, and these people tend to be more loyal and come back to buy more. Increasing benefit or creating customer loyalty program campaign would be helpful to retain these customers since they are can be quite a money maker.