Suppose you have a machine learning dataset for training, where only a few data items have a positive label (class = 1), but all the other data items are unlabeled and could be either negative (class = 0) or positive. This is called a positive and unlabeled learning (PUL) problem. PUL problems often appear in medical scenarios (only a few patients are diagnosed as class 1, all others are unknown) and in security scenarios.

To make sense of PUL data and use it to train a prediction model, you must somehow use the information contained in the PUL data to make intelligent guesses about the labels for the unlabeled items. This is called "finding reliable negatives".

This is a very difficult problem. I've experimented with dozens of schemes for identifying reliable negatives in PUL data. The bottom line is that all techniques have many hyperparameters and results can vary wildly.

For my experiments, I set up a synthetic dataset with 200 items of Employee information. The data looks like:

  -2   0.39   0   0   1   0.5120   0   1   0   1   0.24   1   0   0   0.2950   0   0   1  -2   0.36   1   0   0   0.4450   0   1   0  -2   0.50   0   1   0   0.5650   0   1   0  -2   0.19   0   0   1   0.3270   1   0   0  . . .  

The first column is introvert or extrovert, encoded as 1 = positive = extrovert (20 items), and -2 = unlabeled (180 items). The goal of PUL is to intelligently guess 0 = negative, or 1 = positive, for as many of the unlabeled data items as possible.

The other columns in the dataset are employee age (normalized by dividing by 100), city (one of three, one-hot encoded), annual income (normalized by dividing by $100,000), and job-type (one of three, one-hot encoded).

The dataset was artificially constructed so that even numbered items [0], [2], [4], etc. are actually class 0 = negative, and odd numbered items [1], [3], [5], etc. are actually class 1. This allows the PUL system to measure its accuracy. In a non-demo PUL scenario, you won't know the true class labels.

My latest exploration used this approach:

  create a dataset with all 20 known positive items   and 20 items with random inputs marked as negative    use dataset to train a binary classifier (where   the output is a p-value between 0 and 1)    scan dataset to find min p-score for the 20    positive items and the max p-score    loop each item of the PUL data    feed item to binary classifier and     compute the p-score    if label = 1 then      it's a known positive, continue    else if p-score less-than min_p_score * 0.9      mark this item as a reliable negative class 0    else if p-score grtr-than max_p_score * 0.9      mark this item as a relaible positive class 1    else      not enough evidence so leave as unlabeled    end-if  end-loop  

Once you have examined the PUL data and identified reliable negatives (and new reliable positives), you can either 1.) repeat the process with the updated dataset, or 2.) toss out the unlabeled items and then use the dataset to train a prediction model.

The ideas are conceptually very simple, but implementation is tricky. My results were quite satisfactory -- but depend on over a dozen hyperparameters (batch_size, optimization algorithm, learning rate, NN architecture, weight initialization algorithm, etc., etc.)

Interesting topic.



Here are three cars made in 1970 that routinely show up in Internet searches for "ugliest cars of the 70s" and so they'd be labeled class 1 = positive (ugly). But I would assign a class label of class 0 = not ugly to all three. Left: AMC Javelin AMX (a competitor to the Ford Mustang of the time). Center: Datsun (Nissan) 510 in front of Univ. of Calif. at Irvine which was under construction at the time. I had this model of car and went to UCI when it was still under construction. Right: AMC Pacer. Weird but appealing (to me) car with a passenger side door that was 4 inches longer than the driver side door!


Code (PyTorch) below. Long. Read more of this post