Datasets

lightfm.datasets.movielens.fetch_movielens(data_home=None, indicator_features=True, genre_features=False, min_rating=0.0, download_if_missing=True)[source]

Fetch the Movielens 100k dataset.

The dataset contains 100,000 interactions from 1000 users on 1700 movies, and is exhaustively described in its README.

Parameters:
  • data_home (path, optional) – Path to the directory in which the downloaded data should be placed. Defaults to ~/lightfm_data/.
  • indicator_features (bool, optional) – Use an [n_items, n_items] identity matrix for item features. When True with genre_features, indicator and genre features are concatenated into a single feature matrix of shape [n_items, n_items + n_genres].
  • genre_features (bool, optional) – Use a [n_items, n_genres] matrix for item features. When True with item_indicator_features, indicator and genre features are concatenated into a single feature matrix of shape [n_items, n_items + n_genres].
  • min_rating (float, optional) – Minimum rating to include in the interaction matrix.
  • download_if_missing (bool, optional) – Download the data if not present. Raises an IOError if False and data is missing.

Notes

The return value is a dictionary containing the following keys:

Returns:
  • train (sp.coo_matrix of shape [n_users, n_items]) – Contains training set interactions.
  • test (sp.coo_matrix of shape [n_users, n_items]) – Contains testing set interactions.
  • item_features (sp.csr_matrix of shape [n_items, n_item_features]) – Contains item features.
  • item_feature_labels (np.array of strings of shape [n_item_features,]) – Labels of item features.
  • item_labels (np.array of strings of shape [n_items,]) – Items’ titles.
lightfm.datasets.stackexchange.fetch_stackexchange(dataset, test_set_fraction=0.2, min_training_interactions=1, data_home=None, indicator_features=True, tag_features=False, download_if_missing=True)[source]

Fetch a dataset from the StackExchange network.

The datasets contain users answering questions: an interaction is defined as a user answering a given question.

The following datasets from the StackExchange network are available:

  • CrossValidated: From stats.stackexchange.com. Approximately 9000 users, 72000 questions, and 70000 answers.
  • StackOverflow: From stackoverflow.stackexchange.com. Approximately 1.3M users, 11M questions, and 18M answers.
Parameters:
  • dataset (string, one of ('crossvalidated', 'stackoverflow')) – The part of the StackExchange network for which to fetch the dataset.
  • test_set_fraction (float, optional) – The fraction of the dataset used for testing. Splitting into the train and test set is done in a time-based fashion: all interactions before a certain time are in the train set and all interactions after that time are in the test set.
  • min_training_interactions (int, optional) – Only include users with this amount of interactions in the training set.
  • data_home (path, optional) – Path to the directory in which the downloaded data should be placed. Defaults to ~/lightfm_data/.
  • indicator_features (bool, optional) – Use an [n_users, n_users] identity matrix for item features. When True with genre_features, indicator and genre features are concatenated into a single feature matrix of shape [n_users, n_users + n_genres].
  • download_if_missing (bool, optional) – Download the data if not present. Raises an IOError if False and data is missing.

Notes

The return value is a dictionary containing the following keys:

Returns:
  • train (sp.coo_matrix of shape [n_users, n_items]) – Contains training set interactions.
  • test (sp.coo_matrix of shape [n_users, n_items]) – Contains testing set interactions.
  • item_features (sp.csr_matrix of shape [n_items, n_item_features]) – Contains item features.
  • item_feature_labels (np.array of strings of shape [n_item_features,]) – Labels of item features.