RevoU Spreadsheet & Statistics Project
Project Summary
Improving the Understanding of Spreadsheets to do data cleaning, Exploratory Data Analysis (EDA) and generate useful insights to drive an informed decision-making process, and this 5 main things that I did in this project :
- Cleaning the dataset such as parsing/split data, removing duplicates & null values, data formating
- Using various function in spreadsheet that related for analysis such as particularly filter, sort function, pivot, vlookup, importrange, index-match, etc
- Performed statistical analysis such as statistical description and data distribution analysis also using various type of charts.
- Performed Exploratory Data Analysis (EDA) such as univariate and multivariate analysis to get meaningful insights.
- Performed advanced statistical analysis such as linear regression,corelation analysis and hypothesis testing.
Project Files
For a more comprehensive analysis and visualization, please open the project files.Project background & Dataset
Project background
As a data analyst, we are expected to be able to do some work related to spreadsheets such as data cleaning and EDA. Therefore, we want to test our level of comprehension by doing this assignment. This will help us hone our spreadsheet skill as a data analyst.
Dataset 1| Property Listings in Kuala Lumpur
Provided on Kaggle, This dataset Originally from tabular result of scraping a property listing website for properties for sale in Kuala Lumpur, Malaysia. Only the overview page was scraped so individual property details are scarce.
Dataset 2 | Brazilian E-Commerce Public Dataset by Olist
Provided on Kaggle,This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers.
on this project we only use 3 tab of brazilian ecommerce dataset (order_datasets , customers_datasets , order_payments_datasets)
Dataset 3 | A/B Test Result Dataset
Provided on Kaggle, This is a classified e-commerce company data. The dataset is about test result from an experimental landing page. The dataset includes user id, timestamp, group, and converted.
CASE 1
1. Clean up the data by removing records with empty price, splitting price column to get numbers only with number format.
We used filter and split text into column functions.
2. Share some insight that we can get using univariate analysis.
This graph was created to see the number of types of furnishings with the highest percentage. It can be concluded that Partly furnished occupies the top position with 2653 listing (60,2%), followed by Fully Furnished with 1249 listings(28,4%) and undefined with 560 listings.
This graph was created to see the location of the property with the highest percentage. It can be concluded that Mont Kiara, KL occupies the top position with 710 listings, followed by KLCC, KL with 593 and Desa ParkCity with 353.
This graph is made to see the number of property types with the highest listings. It can be concluded that condominiums occupy the top position with 1171 listings, followed by serviced residences with 673 listings and condominiums (corners) with 602 listings.
CASE 2
3. Combine orders dataset, customers dataset, and order payment dataset into 1 sheet.
In combining datasets, the vlookup
and importrange
functions in the spreadsheet are used.
4. Based on the combined data above, do some data cleaning.
- We check whether there are a duplicate record and clean them using coditional formatting
- We check whether there are null/empty values and remove them using sort and filter function.
5. Do descriptive analysis to clean the data by spotting outliers.
-The minimum payment value is 0 and the maximum is 789.685 which results in a payment value range of 789.685. This value is quite large when compared to the average payment value of 329. The mean > median > mode indicates a positive skew. The standard deviation is 11.403 Kurtosis -371,510 shows distribution that relative flat
we spotted border of outlier if we use constanta 1.5 for the lower outlier -200, and the upper 327, and if we use constanta 3 for the lower outlier -463, and the upper 590
-The minimum value for actual delivery time is 0 days and the maximum is 166 days which results in an actual delivery time range of 166 days. This value is quite large when compared to the average payment value of 12 days. The mean > median > mode indicates a positive skew. The standard deviation is 10. Kurtosis -3.206 showed peaked data compared to normal distribution.
we spotted border of outlier if we use constanta 1.5 for the lower outlier -16, and the upper 28, and if we use constanta 3 for the lower outlier -39, and the upper 1
6. Create basic charts from the order data above.
The graph shows a comparison between the number of orders vs the current month from October 2016 to August 2018. Over time the number of orders has increased until it reaches the highest number of orders in November 2017 of 372 orders.Then slowly went down to August 2018.
The graph shows the comparison between the number of orders vs the day on which the order was made. This chart is made to see Daily Trends/peak days where buyers place orders. It can be concluded that the peak day is Tuesday, where the total number of orders reached 791 orders.
The graph shows the comparison between the number of orders vs. payment type. This graph is made to see what payment types are most often used by buyers. The most obtained were credit cards with a percentage of 75.4%, followed by boleto with a percentage of 19.7%, vouchers at 3.6% and finally debit cards at 1%.
7. Explore what insight that we can get using bivariate or multivariate analysis.
The graph shows the comparison between the number of payments per state/country. State SP is in the highest payment with a total payment of 278,658. Followed by State RJ, MG, RS and so on. 80% of payments are generated by State SP to DF, so that stakeholders can focus on taking action on that State.
There is a very weak linear correlation between payment value and payment installments.
8. Find variables that corellates with each other using scatterplot, and make linear regression between variables.
From the scatter plot above, even though have 0.85 correlation the scatter plot type is heteroscedasticity because it doesn't have equal variance between two variable
From the scatter plot above, we can calculate values and predict the exceeded values using the equation straight line formula, y=mx+c. For example, when the Service Price value is at 500, the Payment Value will be 1.06*500+13.3 = 543,3
9.
An e-commerce company is revamping a landing page after various analysis and research, and want to experiment whether the new page will give better conversion rate before rolling out to wilder audience.
Given the experiment results from the control and experimental group we have a hypothesis that the new page (treatment group) will give a better conversion rate.
- Test whether hypothesis is proven or not using z-test.
- Determine if the test reject or fail to reject the hyphotesis.
In the experiment that have been carried out, P-value result is at 0.3, the value is higher than Significance level (>0.05) and indicates strong evidence for the null hypothesis. This means we fail to reject the null hypothesis