RevoU Python Project


Project Summary

Improving the Understanding of Python function to help read, gather, manipulate, and analyze the data and generate useful insights to drive an informed decision-making process, and this 3 main things that I did in this project :

  1. Conducted data cleaning and preparation such as removing irrelevant values, handling missing data, removing duplicates, handling outlier, string manipulation,time series manipulation, combine & merge datasets and feature engineering.
  2. Conducted exploratory data such as data aggregation, descriptive statistic, analyzing using visualization
  3. Conducted Cluster Analysis such as created user segmentation using k-means clustering, cluster interpretation, recommendation based on cluster result

Project Files

For a more comprehensive analysis and visualization, please open the project files.

Project Background, Data scope, Dataset & Situation

Project Background

Python is the most challenging yet also exciting data programming languages. In this assignment we practiced python skills such as data cleaning and exploratory data analysis using python code. Also, we practiced more advanced python skills such as user segmentation using cluster analysis. We used Google Collab as a python notebook tools.

Data Scope

For this assignment we used data from kaggle and provided by olist, brazilian e-commerce company Customers Dataset. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

Dataset

1. Order_dataset

Provided by RevoU faculty drive as csv. order_dataset consist of datetime summary of their transaction in company.
Order Data Definition :

  1. order_id             : Order's unique identifier.
  2. customer_id           : Customer's unique identifier
  3. order_status           : Customer's order status
  4. order_purchase_timestamp     : Customer's timestamp order
  5. order_approved_at        : Customer's approved order time
  6. order_delivered_carrier_date     : Time of Order's delivered to carrier
  7. order_delivered_customer_date   : Time of Order's delivered to customer.

2. Customers_dataset

Provided by RevoU faculty drive as csv which consist of order id.The Customer Data consist of payment summary of company's customers.
Customer Data Definition:

  1. customer_id         : Customer's unique identifier
  2. customer_unique_id    : Customer's unique unique identifier
  3. customer_zip_code_prefix  : Customer's zip postal code
  4. customer_city       : Customer's city
  5. customer_state       : Customer's state

3. Order_payments_dataset

Provided by RevoU faculty drive as csv which consist of customer id. The Order Paymentconsist of company's customers data summary.
Order Payment Data Definition:

  1. order_id                     : Order's unique identifier
  2. payment_sequential  : Customer's payment sequential
  3. payment_type            : Customer's payment type
  4. payment_installments: Customer's payment installments
  5. payment_value           : Customer's payment value

Situation

Gustavo just started an e-commerce startup based in Brazil that recently opened an online website to sell their product. Fortunately, Gustavo is launcing their website when the covid-19 hits and making them grow faster than ever. However, Gustavo is still not using targeted marketing which hurts their marketing budget as only a fraction of their user comes back to their website. Gustavo needs your help to increase their marketing conversion rate by doing more targeted marketing using customer segmentation so that it will not hurt their budget.

Data Analysis

Note : Only Descriptive Analytics,EDA & Visualization,Customer Segmentation shown to simplify the analysis explanation. For further detail Check Google Colab File in Project file section above

Descriptive Analytics


  • count      : it contains the total amount of data in the column
  • mean      : it contains mean of data in the column
  • std          : standard deviation is a way to measure how far the average value lies from the mean. On the data above means that
                          payment_instalment's data has 2.73 point away from its mean, and payment_value's data has 177 point away from its
                          mean
  • min         : it contain minimum value of data in the column
  • max        : it contain maximum value of data in the column
  • 25%        : 25% percentile is Q1 in the column, it contain the middle value of data between minimum value and median. Q1 for
                          payment_instalment is 1, and Q1 of payment_value is 61.82
  • 50%        : 50% percentile is the median of data in the column. Median for payment_instalment is 2, and Median for payment_value
                         105.01
  • 75%        : 75% percentile is Q3 in the column, it contain the middle value of data between median and maximum value. Q3 for
                          payment_instalment is 4, and Q3 of payment_value is 175.78
  • skew       : payment_instalment's column has skew value 1.64, its positive value so it will be a positive skewness chart, and the
                          value is more than 1 which mean the chart will be highly skewed. Payment_value's column has positive 4.22 which
                          mean the chart is positive highly skewed
  • var          : Its talking about the spread between numbers in a data set. Variance is a measurement to determine how far each
                          number is from the mean and from every other number in the set. Payment_instalment's data has 7.45 point, normaly
                          std is square root of var, the var point is below std square of payment_installment which mean the data will be narrow
                          mean and each other. Payment_value's data has 31,328.15 point, its far exceed std square of payment_value which
                          mean the data will spread far from mean and distribution chart will be more broad
  • kurt         : Payment_instalment's has positive 2.77 point of kurtosis, its exceed +1 which mean the distribution chart data would be
                          peaked (Leptokurtic), and Payment_value's data has positive 25.15 point which mean the distribution chart data would
                          be so peaked (Leptokurtic)
  • 0             : Its contain mode on each column
  • CV          : coefficient of variation is the ratio between the standard deviation and the mean. Payment_instalment's has 91.86% cov
                          which mean the data has greater the level of dispersion around the mean. Payment_value's has 114.53%, its more then
                          100%, and its simply means that the standard deviation exceed the mean value
  • range     : its contain the range of data on each column
  • Exploratory Data Analysis (EDA)

    Monthly total order



    The Number of Order Per Month Histogram tell us that the total order month by month is gradualy growing from October 2016 and reaching its most peak point at November 2017. November 2017 is recorded too as the highest growth rate month. But after reach its peak, the growth in number of orders per moonth tends to stagnate till the last recorded data on August 2018.

    Busiest day of week



    The Busiest Day of Week Histogram tell us that trend orders mostly occur in the Midweek (Midweek is the middle of the week that roughly the period from Tuesday to Thursday), with Tuesday setting a record with the day with the most orders

    Percentage payment type users



    According to the Payment Type pie chart, Gustavo's customers are dominated by credit cards users with 75.9% (or 73,896 total users) following by boleto with 19.9% (or 19,376 total users), voucher 2.69% (or 2,615 total users), and debit card 1.54% (or 1,501 total users)

    Delivery Performance



    The data below is an alert information for Mr. Gustavo to prepare preventive measures to avoid the anger of customers whose orders are late. The delay data can be used for evaluation, which courier has problems or is often late, restructuring inventory warehouse, and workflow
    According to the Delivery Performance pie chart below, Gustavo's performance recorded 7.91% or 7114 order late during these almost 2 year dataset record

    Late Orders



    Filled line chart below describe delays in delivery orders often occur during peak season periods

    Monthly sales growth



    In the sales growth chart below, there is one point that can be highlighted for evaluation, in December 2017, where sales fell -22.32% from the previous month. Gustavo and the team can dig up data on what happened that month and prepare preventive steps (such as a new year promo campaign, flash sale, bundling, or preparing special products) so that it doesn't happen again in December 2018.

    Customer Segmentation

    Segmentation by Customer Geography



    Pareto chart above explain 80/20 rules, that 80% of customer comes from 20% states (6 cities states) from total 27 state, that is Sao Paulo (SP), Rio de Janeiro (RJ), MInas Gerais (MG), Rio Grande do Sul (RS), Parana (PR), Santa Catarina (SC), Bahia (BA). Using 80/20 rules simply means that most of Mr.Gustavo business around 80% comes from customers from 6 states above.
    It can be use to remind Mr.Gustavo about his time management when managing marketing. Mr.Gustavo should focus his time on areas where 80% of customers are there (marketing campaign to stabilize or increase retention should be focused on 20% states above)
    On the other hand, Mr.Gustavo can use this chart for insight to boosting advertising or pushing his marketing team to penetrate market and increase sales outside of those states, if he has more marketing funds.

    Segmentation by Customer Behaviour


    The elbow method show 4 clusters is the optimal cluster number,and The silhouette analysis also show 2 clusters is the optimal cluster number
    we use 4 cluster form elbow to se 4 cluster to make it more varied


    On spender type above we can see that Gustavo's customer is dominated by 59.2% low spender, folowing by lower-mid spender with 26.2%, upper-mid spender 11.4%, and high spender 3.13%.In terms of advertising/campaign needs, while focusing outside 80% state which dominate the order, Gustavo can also focusing target to specific demographics by their spending type customers in each states

    Recommendation

    1. The market predominantly populated with middle class, using discount or promo would be helpful to boost sales.
    2. The real sultans also have significant amount of population, and these people tend to be more loyal and come back to buy more. Increasing benefit or creating customer loyalty program campaign would be helpful to retain these customers since they are can be quite a money maker.