MovieLens数据集#

Note

MovieLens是一个用户对电影评分的数据集,它是推荐系统最常用的数据集。
我们将使用一个已经经过基础处理的MovieLens数据集。

转化为tf dataset#

import os
from tensorflow import keras

#@save
def get_movielens_path(file_name):
    # 获取movielens数据集中文件的路径
    url_prefix = "file:///Users/facer/IdeaProjects/SparrowRecSys/src/main/resources/webroot/sampledata"
    return keras.utils.get_file(file_name, os.path.join(url_prefix, file_name))
import pandas as pd

# 查看数据集格式
df = pd.read_csv(get_movielens_path("trainingSamples.csv"))
df
movieId userId rating timestamp label releaseYear movieGenre1 movieGenre2 movieGenre3 movieRatingCount ... userRatingCount userAvgReleaseYear userReleaseYearStddev userAvgRating userRatingStddev userGenre1 userGenre2 userGenre3 userGenre4 userGenre5
0 1 15555 3.0 900953740 0 1995 Adventure Animation Children 10759 ... 92 1992 8.98 3.86 0.74 Drama Comedy Thriller Action Crime
1 1 25912 3.5 1111631768 1 1995 Adventure Animation Children 10759 ... 21 1988 14.09 3.48 1.28 Action Comedy Romance Adventure Thriller
2 1 29912 3.0 866820360 0 1995 Adventure Animation Children 10759 ... 4 1995 0.50 3.00 0.00 NaN NaN NaN NaN NaN
3 10 17686 0.5 1195555011 0 1995 Action Adventure Thriller 6330 ... 35 1992 8.35 2.97 1.48 Comedy Drama Adventure Action Thriller
4 104 20158 4.0 1155357691 1 1996 Comedy NaN NaN 3954 ... 81 1991 8.70 3.60 0.72 Thriller Drama Action Crime Adventure
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
88822 968 26865 3.0 854092232 0 1968 Horror Sci-Fi Thriller 1824 ... 94 1991 12.23 3.35 0.85 Drama Thriller Comedy Crime Romance
88823 968 8507 2.0 974709061 0 1968 Horror Sci-Fi Thriller 1824 ... 5 1994 0.89 2.00 1.00 NaN NaN NaN NaN NaN
88824 969 16689 5.0 857854044 1 1951 Adventure Comedy Romance 2380 ... 97 1992 9.95 3.53 0.82 Drama Comedy Crime Romance Thriller
88825 969 26460 2.0 1250279576 0 1951 Adventure Comedy Romance 2380 ... 55 1990 11.78 2.73 1.42 Thriller Crime Drama Comedy Sci-Fi
88826 970 3033 2.0 1272394603 0 1953 Adventure Comedy Crime 98 ... 100 1985 17.64 3.67 0.89 Drama Romance Comedy Thriller Crime

88827 rows × 27 columns

#@save
def get_dataset(file_path):
    # load sample as tf dataset
    dataset = tf.data.experimental.make_csv_dataset(
        file_path,
        batch_size=12,
        label_name='label',
        na_value="0",
        num_epochs=1,
        ignore_errors=True)
    return dataset
#@save
def load_movielens():
    # 训练数据集
    train_dataset = get_dataset(get_movielens_path("trainingSamples.csv"))
    # 测试数据集
    test_dataset = get_dataset(get_movielens_path("testSamples.csv"))
    return train_dataset, test_dataset

Tip

那些会复用的函数或类的开头会打上#@save标记,它们会保存在rec.py中,这样其他地方若需要使用只需:import rec