MovieLens数据集
Contents
MovieLens数据集#
Note
MovieLens是一个用户对电影评分的数据集,它是推荐系统最常用的数据集。
我们将使用一个已经经过基础处理的MovieLens数据集。
转化为tf dataset#
import os
from tensorflow import keras
#@save
def get_movielens_path(file_name):
# 获取movielens数据集中文件的路径
url_prefix = "file:///Users/facer/IdeaProjects/SparrowRecSys/src/main/resources/webroot/sampledata"
return keras.utils.get_file(file_name, os.path.join(url_prefix, file_name))
import pandas as pd
# 查看数据集格式
df = pd.read_csv(get_movielens_path("trainingSamples.csv"))
df
movieId | userId | rating | timestamp | label | releaseYear | movieGenre1 | movieGenre2 | movieGenre3 | movieRatingCount | ... | userRatingCount | userAvgReleaseYear | userReleaseYearStddev | userAvgRating | userRatingStddev | userGenre1 | userGenre2 | userGenre3 | userGenre4 | userGenre5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15555 | 3.0 | 900953740 | 0 | 1995 | Adventure | Animation | Children | 10759 | ... | 92 | 1992 | 8.98 | 3.86 | 0.74 | Drama | Comedy | Thriller | Action | Crime |
1 | 1 | 25912 | 3.5 | 1111631768 | 1 | 1995 | Adventure | Animation | Children | 10759 | ... | 21 | 1988 | 14.09 | 3.48 | 1.28 | Action | Comedy | Romance | Adventure | Thriller |
2 | 1 | 29912 | 3.0 | 866820360 | 0 | 1995 | Adventure | Animation | Children | 10759 | ... | 4 | 1995 | 0.50 | 3.00 | 0.00 | NaN | NaN | NaN | NaN | NaN |
3 | 10 | 17686 | 0.5 | 1195555011 | 0 | 1995 | Action | Adventure | Thriller | 6330 | ... | 35 | 1992 | 8.35 | 2.97 | 1.48 | Comedy | Drama | Adventure | Action | Thriller |
4 | 104 | 20158 | 4.0 | 1155357691 | 1 | 1996 | Comedy | NaN | NaN | 3954 | ... | 81 | 1991 | 8.70 | 3.60 | 0.72 | Thriller | Drama | Action | Crime | Adventure |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88822 | 968 | 26865 | 3.0 | 854092232 | 0 | 1968 | Horror | Sci-Fi | Thriller | 1824 | ... | 94 | 1991 | 12.23 | 3.35 | 0.85 | Drama | Thriller | Comedy | Crime | Romance |
88823 | 968 | 8507 | 2.0 | 974709061 | 0 | 1968 | Horror | Sci-Fi | Thriller | 1824 | ... | 5 | 1994 | 0.89 | 2.00 | 1.00 | NaN | NaN | NaN | NaN | NaN |
88824 | 969 | 16689 | 5.0 | 857854044 | 1 | 1951 | Adventure | Comedy | Romance | 2380 | ... | 97 | 1992 | 9.95 | 3.53 | 0.82 | Drama | Comedy | Crime | Romance | Thriller |
88825 | 969 | 26460 | 2.0 | 1250279576 | 0 | 1951 | Adventure | Comedy | Romance | 2380 | ... | 55 | 1990 | 11.78 | 2.73 | 1.42 | Thriller | Crime | Drama | Comedy | Sci-Fi |
88826 | 970 | 3033 | 2.0 | 1272394603 | 0 | 1953 | Adventure | Comedy | Crime | 98 | ... | 100 | 1985 | 17.64 | 3.67 | 0.89 | Drama | Romance | Comedy | Thriller | Crime |
88827 rows × 27 columns
#@save
def get_dataset(file_path):
# load sample as tf dataset
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=12,
label_name='label',
na_value="0",
num_epochs=1,
ignore_errors=True)
return dataset
#@save
def load_movielens():
# 训练数据集
train_dataset = get_dataset(get_movielens_path("trainingSamples.csv"))
# 测试数据集
test_dataset = get_dataset(get_movielens_path("testSamples.csv"))
return train_dataset, test_dataset
Tip
那些会复用的函数或类的开头会打上#@save标记,它们会保存在rec.py中,这样其他地方若需要使用只需:import rec