-
Notifications
You must be signed in to change notification settings - Fork 19.7k
Description
Hi there,
In the California Housing documentation, one reads:
The 8 input features are the following:
MedInc: median income in block groupHouseAge: median house age in block groupAveRooms: average number of rooms per householdAveBedrms: average number of bedrooms per householdPopulation: block group populationAveOccup: average number of household membersLatitude: block group latitudeLongitude: block group longitude
However, loading the dataset does not yield anyting resembling that order, and it's weirdly difficult to find (the source website seems to be unresponsive as well...)
Code to reproduce:
import keras
(train_data, train_targets), (test_data, test_targets) = keras.datasets.california_housing.load_data(version="small")
# Longitude, Latitude, HouseAge, Population, MedInc seems ok, but not
# AveBedrms, AveRooms, AveOccup...
column_names = [
'Longitude', 'Latitude', 'HouseAge', 'AveBedrms', 'AveRooms',
'Population','AveOccup', 'MedInc',
]
for name in column_names:
print(f"{name:>12}", end="")
print()
print("-" * 96)
for row in range(20):
for col in range(8):
x = train_data[row, col]
print(f"{x:12.4f}", end="")
print()Output:
Longitude Latitude HouseAge AveBedrms AveRooms Population AveOccup MedInc
------------------------------------------------------------------------------------------------
-122.2400 37.7300 21.0000 7031.0000 1249.0000 2930.0000 1235.0000 4.5213
-122.2800 37.8500 48.0000 2063.0000 484.0000 1054.0000 466.0000 2.2625
-122.2900 37.8200 2.0000 158.0000 43.0000 94.0000 57.0000 2.5625
-122.2900 37.8100 46.0000 935.0000 297.0000 582.0000 277.0000 0.7286
-122.1800 37.7600 37.0000 1575.0000 358.0000 933.0000 320.0000 2.2917
-122.2300 37.7900 48.0000 1696.0000 396.0000 1481.0000 343.0000 2.0375
-122.2800 37.8400 52.0000 729.0000 160.0000 395.0000 155.0000 1.6875
-122.2800 37.8900 52.0000 2315.0000 408.0000 835.0000 369.0000 4.5893
-122.2500 37.8100 29.0000 4656.0000 1414.0000 2304.0000 1250.0000 2.4912
-122.2200 37.8100 52.0000 2927.0000 402.0000 1021.0000 380.0000 8.1564
-122.2700 37.7700 52.0000 1710.0000 481.0000 849.0000 457.0000 2.7115
-122.2700 37.8800 52.0000 3360.0000 648.0000 1232.0000 621.0000 4.2813
-122.1800 37.7700 27.0000 909.0000 236.0000 396.0000 157.0000 2.0786
-122.2600 37.8800 52.0000 2255.0000 410.0000 823.0000 377.0000 5.7979
-122.1700 37.7400 46.0000 769.0000 183.0000 693.0000 178.0000 2.2500
-122.2900 37.8700 50.0000 1829.0000 536.0000 1129.0000 516.0000 2.6684
-122.2700 37.8700 35.0000 3218.0000 1108.0000 1675.0000 1000.0000 1.7464
-122.2100 37.8000 38.0000 2254.0000 535.0000 951.0000 487.0000 3.0812
-122.2400 37.8000 52.0000 996.0000 228.0000 731.0000 228.0000 2.2697
-122.2800 37.8700 52.0000 1233.0000 300.0000 571.0000 292.0000 2.2788
Some of those columns don't look quite like what you get here either, for instance four columns have numbers in the hundreds/thousands (ok for MedInc and Population, but not for AveBedrms, AveRooms or AveOccup)...
What is the correct order of the features? And could the list in the documentation reflect that (or even better: maybe the dataset object could have a method returning the features list)?
Here a colab where I compare the two datasets, but sorting out the last three columns is still unsolved.
Thanks in advance for your help!