Data Preprocessing#
Note
To apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocessing it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting.
Reading the dataset#
import os
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')
import pandas as pd
# pandas replaced all CSV entries with value NA with a special NaN (not a number) value
data = pd.read_csv(data_file)
data
| NumRooms | RoofType | Price | |
|---|---|---|---|
| 0 | NaN | NaN | 127500 |
| 1 | 2.0 | NaN | 106000 |
| 2 | 4.0 | Slate | 178100 |
| 3 | NaN | NaN | 140000 |
Missing values#
# integer-location based indexing
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs
| NumRooms | RoofType | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | 2.0 | NaN |
| 2 | 4.0 | Slate |
| 3 | NaN | NaN |
# for categorical input fields, we can treat NaN as a category
inputs = pd.get_dummies(inputs, dummy_na=True)
inputs
| NumRooms | RoofType_Slate | RoofType_nan | |
|---|---|---|---|
| 0 | NaN | False | True |
| 1 | 2.0 | False | True |
| 2 | 4.0 | True | False |
| 3 | NaN | False | True |
# for missing numerical values, we can replace the NaN entries with the mean value
inputs = inputs.fillna(inputs.mean())
inputs
| NumRooms | RoofType_Slate | RoofType_nan | |
|---|---|---|---|
| 0 | 3.0 | False | True |
| 1 | 2.0 | False | True |
| 2 | 4.0 | True | False |
| 3 | 3.0 | False | True |
Conversion to the tensor format#
import torch
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y
(tensor([[3., 0., 1.],
[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=torch.float64),
tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))