Data Preprocessing#

Note

To apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocessing it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting.

Reading the dataset#

import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')
import pandas as pd

# pandas replaced all CSV entries with value NA with a special NaN (not a number) value
data = pd.read_csv(data_file)
data
NumRooms RoofType Price
0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000

Missing values#

# integer-location based indexing
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs
NumRooms RoofType
0 NaN NaN
1 2.0 NaN
2 4.0 Slate
3 NaN NaN
# for categorical input fields, we can treat NaN as a category
inputs = pd.get_dummies(inputs, dummy_na=True)
inputs
NumRooms RoofType_Slate RoofType_nan
0 NaN False True
1 2.0 False True
2 4.0 True False
3 NaN False True
# for missing numerical values, we can replace the NaN entries with the mean value
inputs = inputs.fillna(inputs.mean())
inputs
NumRooms RoofType_Slate RoofType_nan
0 3.0 False True
1 2.0 False True
2 4.0 True False
3 3.0 False True

Conversion to the tensor format#

import torch
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y
(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))