Skip to content

Test TracIn's effectiveness in text classification #8

Description

@yanan1116

Hello,

I adopt the code from https://github.com/frederick0329/TracIn/blob/master/imagenet/resnet50_imagenet_proponents_opponents.ipynb
to text classification.

The primary goal of my task is to rank the training samples based on their positive or negative impacts on the clean validation set. The core metrics can be accuracy or cross entropy loss for my task. Quite straightforward. Where the training samples could be 100-200 and validation set contains no more than 100 samples. This is a low-data regime.

Validation set is of no error. It Is clean.

Labels include politics business tech entertainment etc. Just a public news topic classification task: AG NEWS.

As for the classifier, similar to your resnet in the image example, I am using CMLM from tensorflow hub and vectorize all samples to 1024 sentence embeddings. Therefore the classifier is quite simple: a single layer network.

here is my implementation
https://github.com/yananchen1989/topic_classification_augmentation/blob/main/cmlm_proponents_opponents.py

I use AUC in the last, to test the effectiveness: high auc indicate that samples of no labelling noise get higher influence score, while samples wrongly labelled, get lower, negative score.
However, the auc is 0.55. Quite woeful.

I am not sure if there is a bug in my implementation, or I have not using TracIn in a appropriate manner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions