DeRDaVa’s User Guide#

The objective of the derdava package is to perform data valuation in machine learning (ML), through which we know the value or worth of each data source. To start data valuation, we need to initialize the following:

  • Data sources: Each data source can be either a single data point, or a smaller dataset. We require data sources to be a dictionary:

    data_sources = { 0: (X_0, y_0), 1: (X_1, y_1) }
    

    You can also generate random data sources from one of the built-in datasets (see derdava.dataset.load_dataset()). For example,

    from derdava.data_source import generate_random_data_sources
    from derdava.dataset import load_dataset
    
    X, y = load_dataset('phoneme')
    data_sources = generate_random_data_sources(X, y, num_of_data_sources=10)
    
  • ML model: You need to load built-in models:

    from derdava.model_utility import model_knn
    
    model = model_knn
    

Now we can start performing data valuation. Follow the steps below:

  • Create a model utility function of class ModelUtilityFunction. For example,

    from derdava.model_utility import IClassificationModel
    
    model_utility_function = IClassificationModel(model, data_sources, X_test, y_test)
    
  • If you are using DeRDaVa, you need to create a CoalitionProbability to tell you the staying probability of each coalition:

    from derdava.coalition_probability import IndependentCoalitionProbability
    
    staying_probabilities = { i: 0.5 for i in range(10) }
    coalition_probability = IndependentCoalitionProbability(staying_probabilities)
    
  • We can finally do data valuation:

    from derdava.data_valuation import ValuableModel
    
    support = tuple(range(10))
    valuable_model = ValuableModel(support, model_utility_function)
    shapley_values = valuable_model.valuate(data_valuation_function='shapley')
    zot_mcmc_beta_16_1_values = valuable_model.valuate(data_valuation_function='012-mcmc robust beta', alpha=16, beta=4, tolerance=1.005)
    

Please refer to the documentation for more details on each submodules.