Real Estate Market Analysis Project

Machine Learning Analysis of 2.2M+ Property Records Across the United States

Dataset Size
2.2M+ Records
Model Accuracy
45.7%
States Analyzed
15
Analysis Type
Predictive ML
Python Pandas Scikit-learn Random Forest Data Visualization

Project Overview

Data Science

Project Objectives

  • Analyze pricing patterns across 2.2+ million real estate records
  • Build predictive models to estimate property values
  • Identify key factors that influence real estate prices
  • Provide market insights for different geographic regions
  • Compare machine learning model performance for price prediction
2.2M+
Property Records
12
Features Analyzed
45.7%
Best Model R² Score
$256K
Mean Absolute Error

Methodology

ML Pipeline

📊 Data Processing

Dataset Characteristics

Original dataset contained 2,226,382 records with 12 features including price, location, property details, and historical sales data.

Significant missing data in key features: 25% missing house_size, 23% missing bathrooms, 22% missing bedrooms.

🔧 Feature Engineering

New Features Created

Price per square foot calculation for standardized comparison

Total rooms combination (bedrooms + bathrooms)

Property category classification (Budget, Mid-range, High-end, Luxury)

🤖 Model Training

Machine Learning Approach

Compared Linear Regression vs Random Forest models

Used 80/20 train-test split on 936,955 properties from top 10 states

Feature importance analysis to identify key pricing factors

Analysis Results

Findings

🏠 Market Distribution

Key Findings

Mid-range properties (48%) dominate the market with 761K+ units

Luxury segment represents only 9% of inventory

Strong middle-market presence indicates healthy market diversity

📍 Geographic Distribution

Market Concentration

Top 3 states (CA, FL, TX) account for 26% of all properties

California leads with 190K+ properties, followed by Florida (182K) and Texas (158K)

🔍 Feature Correlations

Price Correlation Analysis

Bathrooms and house size show strongest correlation (0.49 each)

Price per square foot correlation: 0.42

Bedroom count has weaker correlation (0.27)

⚡ Model Performance

Algorithm Comparison

Random Forest outperformed Linear Regression

R² improvement: 45.7% vs 39.8% (15% better)

MAE improvement: $256K vs $310K (17% better)

Market Insights

Analysis

🌟 Feature Importance Analysis

Primary Value Drivers

House size drives 44% of price variation - most important factor

Bathroom count contributes 33% - second most important

Together, these two features explain 77% of the model's predictions

Geographic Premiums

California location adds 8% premium to property values

New York contributes 3% location premium

Other states show minimal individual impact on pricing

📊 State-by-State Performance Analysis

State Property Count Median Price Average Price Price Gap Market Type
California 190,055 $699,000 $953,475 +36% Premium
Washington 52,243 $550,000 $692,372 +26% High-Value
New York 67,081 $389,000 $790,544 +103% Luxury
Florida 182,543 $369,000 $575,310 +56% Growth
Arizona 56,124 $419,900 $530,975 +26% Stable

Data Science Insights

Model Limitations: 45.7% R² score indicates model explains less than half of price variation - additional features needed for better predictions

Geographic Patterns: Large median-to-average price gaps suggest significant outliers in premium markets (NY: +103%, FL: +56%)

Feature Engineering: House size and bathroom count are most predictive - focus on architectural features for model improvement

Technical Implementation

Code

Key Technical Details

  • Data preprocessing handled 22-25% missing values through strategic filtering
  • One-hot encoding applied to categorical state variables for ML compatibility
  • Random Forest hyperparameters: default configuration with feature importance extraction
  • Cross-validation approach: single train-test split (80/20) on 937K filtered records
  • Performance metrics: R² score and Mean Absolute Error for model comparison
  • Feature correlation analysis using Pearson correlation coefficient