Real Estate Market Analysis Project

Project Overview

Data Science

Project Objectives

Analyze pricing patterns across 2.2+ million real estate records
Build predictive models to estimate property values
Identify key factors that influence real estate prices
Provide market insights for different geographic regions
Compare machine learning model performance for price prediction

2.2M+

Property Records

12

Features Analyzed

45.7%

Best Model R² Score

$256K

Mean Absolute Error

Methodology

ML Pipeline

📊 Data Processing

Dataset Characteristics

Original dataset contained 2,226,382 records with 12 features including price, location, property details, and historical sales data.

Significant missing data in key features: 25% missing house_size, 23% missing bathrooms, 22% missing bedrooms.

🔧 Feature Engineering

New Features Created

Price per square foot calculation for standardized comparison

Total rooms combination (bedrooms + bathrooms)

Property category classification (Budget, Mid-range, High-end, Luxury)

🤖 Model Training

Machine Learning Approach

Compared Linear Regression vs Random Forest models

Used 80/20 train-test split on 936,955 properties from top 10 states

Feature importance analysis to identify key pricing factors

Analysis Results

Findings

🏠 Market Distribution

Key Findings

Mid-range properties (48%) dominate the market with 761K+ units

Luxury segment represents only 9% of inventory

Strong middle-market presence indicates healthy market diversity

📍 Geographic Distribution

Market Concentration

Top 3 states (CA, FL, TX) account for 26% of all properties

California leads with 190K+ properties, followed by Florida (182K) and Texas (158K)

🔍 Feature Correlations

Price Correlation Analysis

Bathrooms and house size show strongest correlation (0.49 each)

Price per square foot correlation: 0.42

Bedroom count has weaker correlation (0.27)

⚡ Model Performance

Algorithm Comparison

Random Forest outperformed Linear Regression

R² improvement: 45.7% vs 39.8% (15% better)

MAE improvement: $256K vs $310K (17% better)

Market Insights

Analysis

🌟 Feature Importance Analysis

Primary Value Drivers

House size drives 44% of price variation - most important factor

Bathroom count contributes 33% - second most important

Together, these two features explain 77% of the model's predictions

Geographic Premiums

California location adds 8% premium to property values

New York contributes 3% location premium

Other states show minimal individual impact on pricing

📊 State-by-State Performance Analysis

State	Property Count	Median Price	Average Price	Price Gap	Market Type
California	190,055	$699,000	$953,475	+36%	Premium
Washington	52,243	$550,000	$692,372	+26%	High-Value
New York	67,081	$389,000	$790,544	+103%	Luxury
Florida	182,543	$369,000	$575,310	+56%	Growth
Arizona	56,124	$419,900	$530,975	+26%	Stable

Data Science Insights

Model Limitations: 45.7% R² score indicates model explains less than half of price variation - additional features needed for better predictions

Geographic Patterns: Large median-to-average price gaps suggest significant outliers in premium markets (NY: +103%, FL: +56%)

Feature Engineering: House size and bathroom count are most predictive - focus on architectural features for model improvement

Technical Implementation

Code

Key Technical Details

Data preprocessing handled 22-25% missing values through strategic filtering
One-hot encoding applied to categorical state variables for ML compatibility
Random Forest hyperparameters: default configuration with feature importance extraction
Cross-validation approach: single train-test split (80/20) on 937K filtered records
Performance metrics: R² score and Mean Absolute Error for model comparison
Feature correlation analysis using Pearson correlation coefficient