{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MovieLens数据集\n", "\n", "```{note}\n", "MovieLens是一个用户对电影评分的数据集,它是推荐系统最常用的数据集。
\n", "我们将使用一个已经经过基础处理的MovieLens数据集。\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 转化为tf dataset" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "from tensorflow import keras\n", "\n", "#@save\n", "def get_movielens_path(file_name):\n", " # 获取movielens数据集中文件的路径\n", " url_prefix = \"file:///Users/facer/IdeaProjects/SparrowRecSys/src/main/resources/webroot/sampledata\"\n", " return keras.utils.get_file(file_name, os.path.join(url_prefix, file_name))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIduserIdratingtimestamplabelreleaseYearmovieGenre1movieGenre2movieGenre3movieRatingCount...userRatingCountuserAvgReleaseYearuserReleaseYearStddevuserAvgRatinguserRatingStddevuserGenre1userGenre2userGenre3userGenre4userGenre5
01155553.090095374001995AdventureAnimationChildren10759...9219928.983.860.74DramaComedyThrillerActionCrime
11259123.5111163176811995AdventureAnimationChildren10759...21198814.093.481.28ActionComedyRomanceAdventureThriller
21299123.086682036001995AdventureAnimationChildren10759...419950.503.000.00NaNNaNNaNNaNNaN
310176860.5119555501101995ActionAdventureThriller6330...3519928.352.971.48ComedyDramaAdventureActionThriller
4104201584.0115535769111996ComedyNaNNaN3954...8119918.703.600.72ThrillerDramaActionCrimeAdventure
..................................................................
88822968268653.085409223201968HorrorSci-FiThriller1824...94199112.233.350.85DramaThrillerComedyCrimeRomance
8882396885072.097470906101968HorrorSci-FiThriller1824...519940.892.001.00NaNNaNNaNNaNNaN
88824969166895.085785404411951AdventureComedyRomance2380...9719929.953.530.82DramaComedyCrimeRomanceThriller
88825969264602.0125027957601951AdventureComedyRomance2380...55199011.782.731.42ThrillerCrimeDramaComedySci-Fi
8882697030332.0127239460301953AdventureComedyCrime98...100198517.643.670.89DramaRomanceComedyThrillerCrime
\n", "

88827 rows × 27 columns

\n", "
" ], "text/plain": [ " movieId userId rating timestamp label releaseYear movieGenre1 \\\n", "0 1 15555 3.0 900953740 0 1995 Adventure \n", "1 1 25912 3.5 1111631768 1 1995 Adventure \n", "2 1 29912 3.0 866820360 0 1995 Adventure \n", "3 10 17686 0.5 1195555011 0 1995 Action \n", "4 104 20158 4.0 1155357691 1 1996 Comedy \n", "... ... ... ... ... ... ... ... \n", "88822 968 26865 3.0 854092232 0 1968 Horror \n", "88823 968 8507 2.0 974709061 0 1968 Horror \n", "88824 969 16689 5.0 857854044 1 1951 Adventure \n", "88825 969 26460 2.0 1250279576 0 1951 Adventure \n", "88826 970 3033 2.0 1272394603 0 1953 Adventure \n", "\n", " movieGenre2 movieGenre3 movieRatingCount ... userRatingCount \\\n", "0 Animation Children 10759 ... 92 \n", "1 Animation Children 10759 ... 21 \n", "2 Animation Children 10759 ... 4 \n", "3 Adventure Thriller 6330 ... 35 \n", "4 NaN NaN 3954 ... 81 \n", "... ... ... ... ... ... \n", "88822 Sci-Fi Thriller 1824 ... 94 \n", "88823 Sci-Fi Thriller 1824 ... 5 \n", "88824 Comedy Romance 2380 ... 97 \n", "88825 Comedy Romance 2380 ... 55 \n", "88826 Comedy Crime 98 ... 100 \n", "\n", " userAvgReleaseYear userReleaseYearStddev userAvgRating \\\n", "0 1992 8.98 3.86 \n", "1 1988 14.09 3.48 \n", "2 1995 0.50 3.00 \n", "3 1992 8.35 2.97 \n", "4 1991 8.70 3.60 \n", "... ... ... ... \n", "88822 1991 12.23 3.35 \n", "88823 1994 0.89 2.00 \n", "88824 1992 9.95 3.53 \n", "88825 1990 11.78 2.73 \n", "88826 1985 17.64 3.67 \n", "\n", " userRatingStddev userGenre1 userGenre2 userGenre3 userGenre4 \\\n", "0 0.74 Drama Comedy Thriller Action \n", "1 1.28 Action Comedy Romance Adventure \n", "2 0.00 NaN NaN NaN NaN \n", "3 1.48 Comedy Drama Adventure Action \n", "4 0.72 Thriller Drama Action Crime \n", "... ... ... ... ... ... \n", "88822 0.85 Drama Thriller Comedy Crime \n", "88823 1.00 NaN NaN NaN NaN \n", "88824 0.82 Drama Comedy Crime Romance \n", "88825 1.42 Thriller Crime Drama Comedy \n", "88826 0.89 Drama Romance Comedy Thriller \n", "\n", " userGenre5 \n", "0 Crime \n", "1 Thriller \n", "2 NaN \n", "3 Thriller \n", "4 Adventure \n", "... ... \n", "88822 Romance \n", "88823 NaN \n", "88824 Thriller \n", "88825 Sci-Fi \n", "88826 Crime \n", "\n", "[88827 rows x 27 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# 查看数据集格式\n", "df = pd.read_csv(get_movielens_path(\"trainingSamples.csv\"))\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#@save\n", "def get_dataset(file_path):\n", " # load sample as tf dataset\n", " dataset = tf.data.experimental.make_csv_dataset(\n", " file_path,\n", " batch_size=12,\n", " label_name='label',\n", " na_value=\"0\",\n", " num_epochs=1,\n", " ignore_errors=True)\n", " return dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#@save\n", "def load_movielens():\n", " # 训练数据集\n", " train_dataset = get_dataset(get_movielens_path(\"trainingSamples.csv\"))\n", " # 测试数据集\n", " test_dataset = get_dataset(get_movielens_path(\"testSamples.csv\"))\n", " return train_dataset, test_dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{tip}\n", "那些会复用的函数或类的开头会打上#@save标记,它们会保存在rec.py中,这样其他地方若需要使用只需:import rec\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }