DeRDaVa’s User Guide#

The objective of the derdava package is to perform data valuation in machine learning (ML), through which we know the value or worth of each data source. To start data valuation, we need to initialize the following:

Data sources: Each data source can be either a single data point, or a smaller dataset. We require data sources to be a dictionary:

data_sources = { 0: (X_0, y_0), 1: (X_1, y_1) }

You can also generate random data sources from one of the built-in datasets (see derdava.dataset.load_dataset()). For example,

from derdava.data_source import generate_random_data_sources
from derdava.dataset import load_dataset

X, y = load_dataset('phoneme')
data_sources = generate_random_data_sources(X, y, num_of_data_sources=10)

ML model: You need to load built-in models:

from derdava.model_utility import model_knn

model = model_knn

Now we can start performing data valuation. Follow the steps below:

Create a model utility function of class ModelUtilityFunction. For example,

from derdava.model_utility import IClassificationModel

model_utility_function = IClassificationModel(model, data_sources, X_test, y_test)

If you are using DeRDaVa, you need to create a CoalitionProbability to tell you the staying probability of each coalition:

from derdava.coalition_probability import IndependentCoalitionProbability

staying_probabilities = { i: 0.5 for i in range(10) }
coalition_probability = IndependentCoalitionProbability(staying_probabilities)

We can finally do data valuation:

from derdava.data_valuation import ValuableModel

support = tuple(range(10))
valuable_model = ValuableModel(support, model_utility_function)
shapley_values = valuable_model.valuate(data_valuation_function='shapley')
zot_mcmc_beta_16_1_values = valuable_model.valuate(data_valuation_function='012-mcmc robust beta', alpha=16, beta=4, tolerance=1.005)

Please refer to the documentation for more details on each submodules.