Skip to content

Feature request: California Housing Dataset, could the documentation reflect the order of the features? #21886

@jchwenger

Description

@jchwenger

Hi there,

In the California Housing documentation, one reads:

The 8 input features are the following:

  • MedInc: median income in block group
  • HouseAge: median house age in block group
  • AveRooms: average number of rooms per household
  • AveBedrms: average number of bedrooms per household
  • Population: block group population
  • AveOccup: average number of household members
  • Latitude: block group latitude
  • Longitude: block group longitude

However, loading the dataset does not yield anyting resembling that order, and it's weirdly difficult to find (the source website seems to be unresponsive as well...)

Code to reproduce:

import keras
(train_data, train_targets), (test_data, test_targets) = keras.datasets.california_housing.load_data(version="small")

# Longitude, Latitude, HouseAge, Population, MedInc seems ok, but not
# AveBedrms, AveRooms, AveOccup...
column_names = [
    'Longitude', 'Latitude',  'HouseAge', 'AveBedrms',  'AveRooms',
    'Population','AveOccup', 'MedInc',
]

for name in column_names:
    print(f"{name:>12}", end="")
print()
print("-" * 96)
for row in range(20):
    for col in range(8):
        x = train_data[row, col]
        print(f"{x:12.4f}", end="")
    print()

Output:

   Longitude    Latitude    HouseAge   AveBedrms    AveRooms  Population    AveOccup      MedInc
------------------------------------------------------------------------------------------------
   -122.2400     37.7300     21.0000   7031.0000   1249.0000   2930.0000   1235.0000      4.5213
   -122.2800     37.8500     48.0000   2063.0000    484.0000   1054.0000    466.0000      2.2625
   -122.2900     37.8200      2.0000    158.0000     43.0000     94.0000     57.0000      2.5625
   -122.2900     37.8100     46.0000    935.0000    297.0000    582.0000    277.0000      0.7286
   -122.1800     37.7600     37.0000   1575.0000    358.0000    933.0000    320.0000      2.2917
   -122.2300     37.7900     48.0000   1696.0000    396.0000   1481.0000    343.0000      2.0375
   -122.2800     37.8400     52.0000    729.0000    160.0000    395.0000    155.0000      1.6875
   -122.2800     37.8900     52.0000   2315.0000    408.0000    835.0000    369.0000      4.5893
   -122.2500     37.8100     29.0000   4656.0000   1414.0000   2304.0000   1250.0000      2.4912
   -122.2200     37.8100     52.0000   2927.0000    402.0000   1021.0000    380.0000      8.1564
   -122.2700     37.7700     52.0000   1710.0000    481.0000    849.0000    457.0000      2.7115
   -122.2700     37.8800     52.0000   3360.0000    648.0000   1232.0000    621.0000      4.2813
   -122.1800     37.7700     27.0000    909.0000    236.0000    396.0000    157.0000      2.0786
   -122.2600     37.8800     52.0000   2255.0000    410.0000    823.0000    377.0000      5.7979
   -122.1700     37.7400     46.0000    769.0000    183.0000    693.0000    178.0000      2.2500
   -122.2900     37.8700     50.0000   1829.0000    536.0000   1129.0000    516.0000      2.6684
   -122.2700     37.8700     35.0000   3218.0000   1108.0000   1675.0000   1000.0000      1.7464
   -122.2100     37.8000     38.0000   2254.0000    535.0000    951.0000    487.0000      3.0812
   -122.2400     37.8000     52.0000    996.0000    228.0000    731.0000    228.0000      2.2697
   -122.2800     37.8700     52.0000   1233.0000    300.0000    571.0000    292.0000      2.2788

Some of those columns don't look quite like what you get here either, for instance four columns have numbers in the hundreds/thousands (ok for MedInc and Population, but not for AveBedrms, AveRooms or AveOccup)...

What is the correct order of the features? And could the list in the documentation reflect that (or even better: maybe the dataset object could have a method returning the features list)?

Here a colab where I compare the two datasets, but sorting out the last three columns is still unsolved.

Thanks in advance for your help!

Metadata

Metadata

Assignees

Labels

type:docsNeed to modify the documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions