Exploratory Data Analysis (EDA) is a fundamental step in the data analysis process. It helps data analysts and data scientists understand the dataset’s structure, identify patterns, detect anomalies, and prepare data for further modeling or machine learning.
In this blog post, we’ll cover key EDA steps with hands-on examples using Python (Pandas, Seaborn, Matplotlib) and Power BI to visualize and analyze data.
1. Understanding the Data
Before diving into analysis, it’s essential to understand the dataset’s structure.
Example (Python): Checking Dataset Structure
import pandas as pd
# Load dataset
df = pd.read_csv("data.csv")
# View first 5 rows
print(df.head())
# Check dataset structure
print(df.info())
# Summary statistics
print(df.describe())
Example (Power BI): Checking Dataset Structure
- Load Data: Import your dataset into Power BI.
- View Fields: Open the Data View to explore columns and data types.
- Summary Statistics: Use the “Summarize” feature or DAX Measures like:
Summary_Stats = SUMMARIZECOLUMNS(
"Average", AVERAGE(Table[Column]),
"Min", MIN(Table[Column]),
"Max", MAX(Table[Column]),
"StdDev", STDEV.P(Table[Column])
)
2. Handling Missing Data
Missing data can significantly impact analysis and model accuracy.
Example (Python): Handling Missing Values
# Check for missing values
print(df.isnull().sum())
# Fill missing numerical values with mean
df['Column'] = df['Column'].fillna(df['Column'].mean())
# Drop rows with missing values
df = df.dropna()
Example (Power BI): Handling Missing Data
- Use Power Query Editor → “Transform” → “Replace Values”
- Use DAX for imputation:
Column_Filled = IF(ISBLANK(Table[Column]), AVERAGE(Table[Column]), Table[Column])
3. Univariate Analysis (Single Variable Analysis)
Analyzing one variable at a time helps understand distributions.
Example (Python): Visualizing Univariate Data
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
sns.histplot(df['Column'], bins=30, kde=True)
plt.show()
# Box Plot
sns.boxplot(x=df['Column'])
plt.show()
Example (Power BI): Visualizing Univariate Data
- Histogram: Use a Clustered Column Chart and set the X-axis to the variable.
- Box Plot: Import Violin & Box Plot visual from the Power BI marketplace.
4. Bivariate Analysis (Two Variable Analysis)
This step helps analyze relationships between two variables.
Example (Python): Correlation Between Two Numerical Variables
# Scatter Plot
sns.scatterplot(x=df['Feature1'], y=df['Feature2'])
plt.show()
# Correlation
print(df[['Feature1', 'Feature2']].corr())
Example (Power BI): Scatter Plot & Correlation
- Use Scatter Chart visualization to plot relationships.
- Use DAX to calculate correlation:
Correlation = CORREL(Table[Feature1], Table[Feature2])
5. Multivariate Analysis (Multiple Variables)
This helps uncover deeper patterns.
Example (Python): Pair Plot & PCA
# Pair Plot
sns.pairplot(df[['Feature1', 'Feature2', 'Feature3']])
plt.show()
# Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df[['Feature1', 'Feature2', 'Feature3']])
Example (Power BI): Multivariate Analysis
- Use Table Visual with multiple measures.
- Create a Correlation Matrix using the Power BI Heatmap visual.
6. Detecting Outliers
Outliers can distort analysis and need careful handling.
Example (Python): Outlier Detection with IQR & Z-score
# Using IQR (Interquartile Range)
Q1 = df['Column'].quantile(0.25)
Q3 = df['Column'].quantile(0.75)
IQR = Q3 - Q1
# Filtering out outliers
df_no_outliers = df[(df['Column'] >= (Q1 - 1.5 * IQR)) & (df['Column'] <= (Q3 + 1.5 * IQR))]
# Using Z-score
from scipy.stats import zscore
df['Z_Score'] = zscore(df['Column'])
df_filtered = df[df['Z_Score'].abs() < 3]
Example (Power BI): Handling Outliers
- Create a new column for Z-score using DAX:
Z_Score = (Table[Column] - AVERAGE(Table[Column])) / STDEV.P(Table[Column])
- Filter records with Z-score between -3 and +3 using a visual filter.
7. Checking Correlations & Relationships
Example (Python): Heatmap for Correlations
# Correlation Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
Example (Power BI): Correlation Heatmap
- Use the Matrix Visual.
- Import the Heatmap Custom Visual from the Power BI marketplace.
8. Feature Engineering & Data Transformation
Feature engineering enhances model performance.
Example (Python): Creating New Features
# Creating a new column
df['New_Feature'] = df['Feature1'] * df['Feature2']
# Encoding categorical variables
df = pd.get_dummies(df, columns=['Category_Column'], drop_first=True)
Example (Power BI): Creating New Features
Use DAX Calculated Columns:
New_Column = Table[Feature1] * Table[Feature2]
9. Detecting Data Quality Issues
Example (Python): Handling Duplicates & Inconsistencies
# Checking duplicates
print(df.duplicated().sum())
# Removing duplicates
df = df.drop_duplicates()
Example (Power BI): Handling Duplicates
Use Power Query → “Remove Duplicates”.
Tools for EDA
- Python: Pandas, NumPy, Seaborn, Matplotlib, Plotly
- Power BI & Tableau: Visual Analytics
- Excel: Pivot Tables, Charts
Conclusion
EDA is a critical step before applying machine learning or predictive modeling. By following these techniques, you can uncover hidden insights and make data-driven decisions.
Want a hands-on Power BI dashboard for EDA? Let me know in the comments!