Data Preprocessing

Contents

Data Preprocessing#

Note

To apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocessing it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting.

Reading the dataset#

import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

import pandas as pd

# pandas replaced all CSV entries with value NA with a special NaN (not a number) value
data = pd.read_csv(data_file)
data

	NumRooms	RoofType	Price
0	NaN	NaN	127500
1	2.0	NaN	106000
2	4.0	Slate	178100
3	NaN	NaN	140000

Missing values#

# integer-location based indexing
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs

	NumRooms	RoofType
0	NaN	NaN
1	2.0	NaN
2	4.0	Slate
3	NaN	NaN

# for categorical input fields, we can treat NaN as a category
inputs = pd.get_dummies(inputs, dummy_na=True)
inputs

	NumRooms	RoofType_Slate	RoofType_nan
0	NaN	False	True
1	2.0	False	True
2	4.0	True	False
3	NaN	False	True

# for missing numerical values, we can replace the NaN entries with the mean value
inputs = inputs.fillna(inputs.mean())
inputs

	NumRooms	RoofType_Slate	RoofType_nan
0	3.0	False	True
1	2.0	False	True
2	4.0	True	False
3	3.0	False	True

Conversion to the tensor format#

import torch
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))