OK, I have sklearn classifier and now I want to extend it. Is it possible?

April 2, 2020

In our product there is scikit-learn classifier (sklearn.linear_model.SGDClassifier to be more precise). We deliver this trained model to our customers but in some cases classes in the model and classes that the customer want to detect could be different.

In this article I’m going to built news classifier that split news feed into three classes: “rec.sport.baseball”, “talk.politics.guns” and “comp.sys.ibm.pc.hardware”. These are default classes from 20 newsgroup dataset. And then I’ll extend the model so that it could classify “sci.electronics” news.

We have some constraints:

We deliver models to our customers’ hardware
We don’t have any possibility to get customers’ documents to train the model
There are several document types (*.docx, *.pdf, *.jpg, etc) (before train the model we extract text from that documents and this process takes too much time)
In some cases customer don’t have enough samples for exact document class (for example only 10 contracts or 10 invoices)
We can’t send our dataset for new customers becouse of NDA

There are several ways to acomplish this task (with the example about news).

Train new classifier with new articles about electronics

This is the simpliest solution but we can’t do that becouse of third and fourth reasons (too much time before testing the model and not enough examples for non-electronics documents)

Send existing documents for “rec.sport.baseball”, “talk.politics.guns” and “comp.sys.ibm.pc.hardware” topics to the customer

We can solve the fourth problem with sending our dataset with politcs, sports and hardware texts. But this points us to the fifth constraint.

Encrypt existing documents for “rec.sport.baseball”, “talk.politics.guns” and “comp.sys.ibm.pc.hardware” topics and then send them to the customer

There are two different ways. Let’s make a deal that we use some fast hash function becouse we need to encryt large texts and customers can’t decrypt the texts.

Decrypt texts on the fly and then train the model

Main problem here is that our texts will be as a plain text in the memory. We can dump memory and then read the whole dataset.

Another problem is that customer can generate hashes for thousands words and find out which word were encrypted for which hash in the dataset.

Encrypt customer’s documents same way and then train the model on encrypted texts

This approach looks like solution. But there are some point that need to be checked. First of all the problem with decription based on generating thousands of hashed exists here.

Generate texts from existing model and train new model on that texts

There are also two cases based on existing classifier and vectorizer (TfidfVectorizer in our case).

Generate documents for each existsing class in the model and add new class with new texts and than train new model on this new dataset

Simplified algorithm looks like that:

train_samples = []

for document_class in classifier:
    words := get_specific_words_from_vectorizer_and_classifier()
    samples := generate_N_samples_for_class()
    train_samples.append(samples)
train_samples.append(new_samples)

model.fit(train_samples)

In our case we got F1 measure less than we train model on real texts. But in this case we don’t need to send our dataset to the customers. So this solution looks nice.

Generate documents from the model and train new binary classifier: existing class vs new class

In this case we have to create copy of the existing classifier model and update some of it’s parameters:

train_samples := []
words := get_all_words_from_vectorizer_and_model
samples := generate_samples_for_existing_classes_and_save_them_as_an_other_class
train_samples.append(samples)
train_samples.append(new_samples_with_new_class_name)
new_model.fit(train_samples)
clone_model := copy(model)
clone_model.update_params_from_model(new_model)

Current article is about this approach.

Generate documents from the model and train new binary classifier: existing class vs new class

We’ll work with default 20-newsgroup from scikit-learn library. So we need some imports for our task:

from typing import List, Tuple

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import roc_auc_score, classification_report, plot_confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import clone as clone_model

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from stop_words import get_stop_words

np.random.seed = 0

So we need to download and prepare the dataset:

twenty_newsgroup = fetch_20newsgroups('./', subset='all')
subj_id_subj_title = dict(zip(range(len(twenty_newsgroup.target_names)), twenty_newsgroup.target_names))

twenty_newsgroup_df = pd.DataFrame()
twenty_newsgroup_df['Text'] = twenty_newsgroup.data
twenty_newsgroup_df['Target'] = twenty_newsgroup.target

# replace subject's id with the subject's name
twenty_newsgroup_df['Target'] = twenty_newsgroup_df['Target'].replace(subj_id_subj_title)

# filter dataset
source_cat = ['rec.sport.baseball', 'talk.politics.guns', 'comp.sys.ibm.pc.hardware']
added_cat = 'sci.electronics'

filtered = twenty_newsgroup_df.copy()
filtered = filtered[filtered['Target'].isin(source_cat + added_cat)]

# prepare initial and extended (additional) datasets
initial_df = filtered[filtered['Target'].isin(source_cat)]
additional_df = filtered[filtered['Target'].isin(added_cat)]

initial_train, initial_test = train_test_split(initial_df, test_size=0.3, random_state=0)
additional_train, additional_test = train_test_split(additional_df, test_size=0.3, random_state=0)

Here we have four datasets: train and test datasets for initial data and train and test datasets for additional data.

Now we can create new pipeline with TfidfVectorizer and SGDClassifier:

vectorizer = TfidfVectorizer(
    use_idf=True, 
    min_df=10,
    max_features=100000,
    ngram_range=(1, 3),
    stop_words=get_stop_words('en'),
    norm='l2')
clf = SGDClassifier(loss='modified_huber', alpha=0.0001, penalty='l2', max_iter=500, random_state=0)

initial_pipeline = Pipeline([
        ('vect', vectorizer),
        ('clf', clf)
    ])

initial_pipeline.fit(initial_train['Text'], initial_train['Target'])

print('Classification report:')
print(classification_report(initial_pipeline.predict(initial_test['Text']), initial_test['Target']))

print('\nROC-AUC Score: {}'.format(roc_auc_score(
            initial_test['Target'], 
            initial_pipeline.predict_proba(initial_test['Text']), 
            multi_class='ovo')))

Output will looks like that:

Classification report:
                          precision    recall  f1-score   support

comp.sys.ibm.pc.hardware       0.99      0.98      0.99       294
      rec.sport.baseball       0.98      0.98      0.98       308
      talk.politics.guns       0.98      0.99      0.98       264

                accuracy                           0.98       866
               macro avg       0.98      0.99      0.99       866
            weighted avg       0.99      0.98      0.98       866


ROC-AUC Score: 0.9994522360572858

And here we go to the main part of the article: adding new class to the existing classifier:

def generate_texts(tokens: List[str]) -> List[str]:
    texts = [' '.join(x) for x in np.random.choice(tokens, (2000, 100))]
    return texts

def extend_classifer(new_df: pd.DataFrame, initial_pipline: Pipeline) -> Pipeline:
    initial_vect = initial_pipeline.named_steps['vect']
    initial_model = initial_pipeline.named_steps['clf']
    
    # generating new samples for existing classes from fitted TfidfVectorizer
    generated_texts = generate_texts(initial_vect.get_feature_names())
    
    # Create new SGDClassifier with same as previous parameters
    clf = SGDClassifier(loss='modified_huber', alpha=0.0001, penalty='l2', max_iter=500, random_state=0)
    
    # Create new dataframe with generated texts and new data
    df = pd.DataFrame({'Text': generated_texts, 'Target': ['Other'] * len(generated_texts)})
    new_df = pd.concat([new_df, df])
    
    # Transform new texts with existing fitted vectorizer
    X = initial_vect.transform(new_df['Text'])
    y = new_df['Target']

    clf.fit(X, y)
    
    # Clone initial model
    new_model = clone_model(initial_model)
    
    # Update initial classifier's attributes
    new_model.classes_ = np.append(initial_model.classes_, clf.classes_[1])
    new_model.coef_ = np.append(initial_model.coef_, clf.coef_, axis=0)
    new_model.intercept_ = np.append(initial_model.intercept_, clf.intercept_)
    
    # New pipeline with updated classifier
    return Pipeline([
            ('vect', initial_vect),
            ('clf', new_model)
        ])

Now we have to check new pipeline:

p = extend_classifer(additional_train, initial_pipeline)

test = pd.concat([initial_test, additional_test])
print('Classification report:')
print(classification_report(p.predict(test['Text']), test['Target']))

print('\nROC-AUC Score: {}'.format(roc_auc_score(
            test['Target'],
            p.predict_proba(test['Text']),
            multi_class='ovo')))

Output:

Classification report:
                          precision    recall  f1-score   support

comp.sys.ibm.pc.hardware       0.78      0.87      0.82       261
      rec.sport.baseball       0.83      1.00      0.91       258
         sci.electronics       0.88      0.64      0.74       409
      talk.politics.guns       0.88      1.00      0.93       234

                accuracy                           0.84      1162
               macro avg       0.84      0.88      0.85      1162
            weighted avg       0.85      0.84      0.84      1162


ROC-AUC Score: 0.6682616497756867

Not exciting results, right?

In my real case I got more interesting results: f1-score for all classes decresed only for 1%. So, may be in your case it will work too?

Whole jupyter notebook can be found on my github.