Welcome to the Data Cleaning & Insight Generation Project! π This project focuses on working with the Kaggle Data Science Survey (2017β2021), a real-world dataset filled with responses from thousands of data professionals worldwide. ππ¨βπ»π©βπ» The goal is to clean messy survey data, handle missing values, encode categorical responses, and generate meaningful insights about respondent behavior and preferences. By transforming the raw survey into a structured dataset, we enable deeper analysis and interactive visualizations that uncover trends in the global data science community. π
Every year, Kaggle conducts a global survey of data scientists, covering their tools, programming languages, education, experience, and career aspirations.
In this project, we focused on:
- β¨ Cleaning and preprocessing survey responses (handling missing values, duplicates, and inconsistent formatting)
- β¨ Applying label encoding/mapping for categorical variables π‘
- β¨ Extracting insights on respondent demographics, education, salary, and tool usage π
- β¨ Building multiple visualizations (pie, bar, scatter, line, box, heatmap, etc.) π¨
- β¨ Generating a summary report & dashboard of the top 5 insights This project transforms raw survey data into a clear and structured analysis of the data science landscape ππ‘.
- πΉ Import, clean, and preprocess the Kaggle survey dataset π§Ή
- πΉ Handle missing values, duplicates, and categorical responses βοΈ
- πΉ Encode categorical variables using label encoding/mapping
- πΉ Create rich visualizations to showcase respondent patterns π¨
- πΉ Extract top insights on demographics, career paths, and tool adoption π
- πΉ Summarize findings in a PDF report & dashboard π
- Language: Python π
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Plotly, Scikit-learn
- Analysis Methods: Data Cleaning | Categorical Encoding | Descriptive Analytics | Insight Generation
- Visualizations: Pie Charts π₯§ | Bar Charts π | Scatter Plots π― | Line Charts π | Boxplots π¦ | Heatmaps π₯ | Histograms π | KPI summaries
The Kaggle Data Science Survey (2017β2021) dataset includes responses from thousands of professionals, covering:
- π€ Demographics (age, gender, country, education)
- πΌ Career & Job Titles
- π² Salary Segments & Experience Levels
- π οΈ Tools, Programming Languages, and Platforms Used
- π― Aspirations, Challenges, and Industry Trends
- Loaded the survey dataset into Python (Pandas)
- Removed duplicates and handled missing values
- Normalized column names and responses
- Applied label encoding for categorical variables
- Analyzed demographics (country, education, gender)
- Explored salary vs. experience distributions
- Identified most popular tools, languages, and platforms
- Compared trends across multiple years
- Created 12+ visualizations: pie, scatter, line, box, heatmap, etc.
- Built a summary dashboard of top 5 insights
- Exported a PDF report summarizing key findings
- βοΈ Python dominates as the most widely used language π
- βοΈ Most respondents hold graduate or postgraduate degrees π
- βοΈ Salary distribution skews towards early-career professionals π²
- βοΈ Machine learning platforms like TensorFlow & scikit-learn are highly adopted π§
- βοΈ The global data science community is rapidly growing π
- π Cleaned Dataset β survey_cleaned.csv
- π Python Notebook/Script β survey_analysis.ipynb / .py
- π Insights Report β survey_report.pdf
- π Visualizations β Charts & Dashboard
This project demonstrates how data cleaning and visualization can transform raw survey responses into actionable insights about the data science community. By analyzing the Kaggle survey, we gain a deeper understanding of the tools, skills, and aspirations shaping the future of data science. ππ













