An agentic, "best-in-class" statistical analysis system built with Agno, Azure OpenAI, and a high-performance data backbone (Polars/Dask/Pandas). This project demonstrates how AI agents can automate rigorous scientific research workflows, from data profiling to executive-level reporting.
Note
This project is a Proof of Concept (POC). It represents a foundational idea for an autonomous researcher and can be expanded for specific domain-expert applications (e.g., Clinical Trials, Market Research, A/B Testing at Scale).
Traditional statistical software (SPSS, SAS, R) either requires deep technical expertise or is too rigid for rapid business inquiry. This project was built to bridge that gap:
- Lower the Barrier: Allow anyone to ask complex business questions in natural language.
- Ensure Rigor: Automate the "boring but critical" parts of statistics (normality checks, variance testing, sample size calculation) to prevent common human errors.
- Handle Big Data: Enable analyzing millions of rows on a laptop without specialized infrastructure.
Powered by a specialized Agno Team, the system divides the workload:
- 🔍 Data Analyst: Proactively profiles the schema and identifies the "Ground Truth" columns needed.
- ⚖️ Stats Executor: The "Brain" that chooses and runs the right test (T-Test, ANOVA, Chi-Square, Mann-Whitney) only after validating mathematical assumptions (e.g., Shapiro-Wilk for normality).
- 📝 Insights Reporter: Translates the "Greek and Math" into a professional business verdict with actionable recommendations.
Instead of arbitrary limits, the system uses a Dynamic Sampling Engine:
- Formula-Driven: Uses Cochran’s Formula (99.9% Confidence, 0.5% Margin of Error) to calculate the exact sample size needed for a representative population slice.
- Big Data Ready: Seamlessly handles files up to 2GB using Dask and Polars while maintaining UI responsiveness.
Every statistical tool is "hardened" with a validation layer:
- Defensive Programming: Detects empty groups, zero-variance data, and insufficient samples before execution.
- Self-Healing: Provides human-readable error messages to the agent, allowing it to pivot or explain data limitations.
- Agentic Framework: Agno
- LLM Backend: Azure OpenAI (GPT-4o)
- Data Processing: Polars (Fast), Dask (Parallel/Large), Pandas (Standard)
- Scientific Suite: Scipy, Statsmodels, Pingouin (Modern Stats)
- UI Foundation: Streamlit (Premium Custom CSS)
- Environment: Python 3.12 +
uvpackage manager
- Python 3.12
- Azure OpenAI Deployment (GPT-4o recommended)
- Clone the repository.
- Create a
.envfile with your credentials:AZURE_OPENAI_API_KEY=your_key AZURE_OPENAI_ENDPOINT=your_endpoint AZURE_OPENAI_API_VERSION=2024-05-01-preview AZURE_OPENAI_DEPLOYMENT=your_deployment_name
- Install dependencies:
uv pip install -r requirements.txt
- Run the app:
streamlit run src/app.py
- Business Leaders: Get instant, scientifically valid answers to "what-if" questions.
- Data Teams: Automate the initial "sanity check" phase of research.
- AI Researchers: Explore agentic patterns for complex scientific tool-calling.
This POC can be expanded into a "Researcher-in-a-Box":
- Automated Visualization: Dynamically generating charts based on the result of the test.
- Correlation Matrix Exploration: Autonomously identifying high-impact variables.
- Advanced ML Integration: Predicting future trends based on the discovered hypothesis.
Built with ❤️ for Robust Statistical Autonomy.