A comprehensive data analysis project exploring transactional patterns, customer behaviour, and product performance using real-world e-commerce data. This project demonstrates advanced data cleaning, feature engineering, and exploratory data analysis (EDA) to extract actionable business insights.
- Filename:
purchase_data.csv - Rows: 541,909 transactions
- Columns: 8 (InvoiceNo, InvoiceDate, Quantity, UnitPrice, StockCode, Description, CustomerID, Country)
- Timeframe: Approximately 2 years
- Geography: 38 countries
- Invoice-level granularity
- Customer-specific insights
- Product descriptions and unit prices
- Date and time stamps for time-series analysis
- Missing
CustomerIDreplaced with"Anonymous" Descriptioncleaned and title-cased- Negative or zero
QuantityandUnitPricevalues removed InvoiceDatestandardized using European format- Removed cancelled transactions (
InvoiceNostarting with "C")
- Revenue:
Quantity * UnitPrice - Temporal Columns: Month, Quarter, Season
- Product Categories: Generated via clustering of
Description
- Detected revenue spikes in July and October 2011
- Stable transaction count, but revenue increased due to high-value purchases
- "Gifts & Decorations" leads revenue across all months
- "Bags & Accessories" and "Household Essentials" show consistent growth
- September peaks tied to seasonal demand for gifts
- Stable product lines identified for year-round promotions
- High-value outliers and loyal customer clusters identified
- Cross-category spending drives higher customer value
- Represent significant revenue (e.g., £640,000+ in "Gifts & Decorations")
- Lack of demographic data (age, income, etc.)
- Outliers affect distribution, even after removal
- Country-level analysis only—no granular geography
- Product grouping relies on unsupervised text clustering
- Add demographic fields to enable deeper segmentation
- Use supervised models (e.g. Random Forest, XGBoost) for predictive insights
- Integrate external data (holiday calendars, marketing spend)
- Build interactive dashboards for real-time decision-making
Pandas,NumPy– data wranglingMatplotlib,Seaborn– visualisationScikit-learn– clustering and potential model integrationJupyter Notebook– analysis environment
This project is released for academic and non-commercial use. Attribution is appreciated.
- Pandas Documentation
- Matplotlib Documentation
- Seaborn Documentation
- NumPy Documentation
- Scikit-learn Documentation
- Python Documentation
Exploratory Data Analyst | E-Commerce Insights | Python Enthusiast