Machine Learning Analysis of 2.2M+ Property Records Across the United States
Original dataset contained 2,226,382 records with 12 features including price, location, property details, and historical sales data.
Significant missing data in key features: 25% missing house_size, 23% missing bathrooms, 22% missing bedrooms.
Price per square foot calculation for standardized comparison
Total rooms combination (bedrooms + bathrooms)
Property category classification (Budget, Mid-range, High-end, Luxury)
Compared Linear Regression vs Random Forest models
Used 80/20 train-test split on 936,955 properties from top 10 states
Feature importance analysis to identify key pricing factors
Mid-range properties (48%) dominate the market with 761K+ units
Luxury segment represents only 9% of inventory
Strong middle-market presence indicates healthy market diversity
Top 3 states (CA, FL, TX) account for 26% of all properties
California leads with 190K+ properties, followed by Florida (182K) and Texas (158K)
Bathrooms and house size show strongest correlation (0.49 each)
Price per square foot correlation: 0.42
Bedroom count has weaker correlation (0.27)
Random Forest outperformed Linear Regression
R² improvement: 45.7% vs 39.8% (15% better)
MAE improvement: $256K vs $310K (17% better)
House size drives 44% of price variation - most important factor
Bathroom count contributes 33% - second most important
Together, these two features explain 77% of the model's predictions
California location adds 8% premium to property values
New York contributes 3% location premium
Other states show minimal individual impact on pricing
State | Property Count | Median Price | Average Price | Price Gap | Market Type |
---|---|---|---|---|---|
California | 190,055 | $699,000 | $953,475 | +36% | Premium |
Washington | 52,243 | $550,000 | $692,372 | +26% | High-Value |
New York | 67,081 | $389,000 | $790,544 | +103% | Luxury |
Florida | 182,543 | $369,000 | $575,310 | +56% | Growth |
Arizona | 56,124 | $419,900 | $530,975 | +26% | Stable |
Model Limitations: 45.7% R² score indicates model explains less than half of price variation - additional features needed for better predictions
Geographic Patterns: Large median-to-average price gaps suggest significant outliers in premium markets (NY: +103%, FL: +56%)
Feature Engineering: House size and bathroom count are most predictive - focus on architectural features for model improvement