OK, I have sklearn classifier and now I want to extend it. Is it possible?
In our product there is scikit-learn
classifier (sklearn.linear_model.SGDClassifier
to be more precise).
We deliver this trained model to our customers but in some cases classes in the model and classes that the customer want to detect could be different.
In this article I’m going to built news classifier that split news feed into three classes: “rec.sport.baseball”, “talk.politics.guns” and “comp.sys.ibm.pc.hardware”. These are default classes from 20 newsgroup dataset. And then I’ll extend the model so that it could classify “sci.electronics” news.
We have some constraints:
- We deliver models to our customers’ hardware
- We don’t have any possibility to get customers’ documents to train the model
- There are several document types (*.docx, *.pdf, *.jpg, etc) (before train the model we extract text from that documents and this process takes too much time)
- In some cases customer don’t have enough samples for exact document class (for example only 10 contracts or 10 invoices)
- We can’t send our dataset for new customers becouse of NDA
There are several ways to acomplish this task (with the example about news).
Train new classifier with new articles about electronics
This is the simpliest solution but we can’t do that becouse of third and fourth reasons (too much time before testing the model and not enough examples for non-electronics documents)
Send existing documents for “rec.sport.baseball”, “talk.politics.guns” and “comp.sys.ibm.pc.hardware” topics to the customer
We can solve the fourth problem with sending our dataset with politcs, sports and hardware texts. But this points us to the fifth constraint.
Encrypt existing documents for “rec.sport.baseball”, “talk.politics.guns” and “comp.sys.ibm.pc.hardware” topics and then send them to the customer
There are two different ways. Let’s make a deal that we use some fast hash function becouse we need to encryt large texts and customers can’t decrypt the texts.
Decrypt texts on the fly and then train the model
Main problem here is that our texts will be as a plain text in the memory. We can dump memory and then read the whole dataset.
Another problem is that customer can generate hashes for thousands words and find out which word were encrypted for which hash in the dataset.
Encrypt customer’s documents same way and then train the model on encrypted texts
This approach looks like solution. But there are some point that need to be checked. First of all the problem with decription based on generating thousands of hashed exists here.
Generate texts from existing model and train new model on that texts
There are also two cases based on existing classifier and vectorizer (TfidfVectorizer
in our case).
Generate documents for each existsing class in the model and add new class with new texts and than train new model on this new dataset
Simplified algorithm looks like that:
train_samples = []
for document_class in classifier:
words := get_specific_words_from_vectorizer_and_classifier()
samples := generate_N_samples_for_class()
train_samples.append(samples)
train_samples.append(new_samples)
model.fit(train_samples)
In our case we got F1 measure less than we train model on real texts. But in this case we don’t need to send our dataset to the customers. So this solution looks nice.
Generate documents from the model and train new binary classifier: existing class vs new class
In this case we have to create copy of the existing classifier model and update some of it’s parameters:
train_samples := []
words := get_all_words_from_vectorizer_and_model
samples := generate_samples_for_existing_classes_and_save_them_as_an_other_class
train_samples.append(samples)
train_samples.append(new_samples_with_new_class_name)
new_model.fit(train_samples)
clone_model := copy(model)
clone_model.update_params_from_model(new_model)
Current article is about this approach.
Generate documents from the model and train new binary classifier: existing class vs new class
We’ll work with default 20-newsgroup
from scikit-learn library. So we need some imports for our task:
from typing import List, Tuple
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import roc_auc_score, classification_report, plot_confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import clone as clone_model
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from stop_words import get_stop_words
np.random.seed = 0
So we need to download and prepare the dataset:
twenty_newsgroup = fetch_20newsgroups('./', subset='all')
subj_id_subj_title = dict(zip(range(len(twenty_newsgroup.target_names)), twenty_newsgroup.target_names))
twenty_newsgroup_df = pd.DataFrame()
twenty_newsgroup_df['Text'] = twenty_newsgroup.data
twenty_newsgroup_df['Target'] = twenty_newsgroup.target
# replace subject's id with the subject's name
twenty_newsgroup_df['Target'] = twenty_newsgroup_df['Target'].replace(subj_id_subj_title)
# filter dataset
source_cat = ['rec.sport.baseball', 'talk.politics.guns', 'comp.sys.ibm.pc.hardware']
added_cat = 'sci.electronics'
filtered = twenty_newsgroup_df.copy()
filtered = filtered[filtered['Target'].isin(source_cat + added_cat)]
# prepare initial and extended (additional) datasets
initial_df = filtered[filtered['Target'].isin(source_cat)]
additional_df = filtered[filtered['Target'].isin(added_cat)]
initial_train, initial_test = train_test_split(initial_df, test_size=0.3, random_state=0)
additional_train, additional_test = train_test_split(additional_df, test_size=0.3, random_state=0)
Here we have four datasets: train and test datasets for initial data and train and test datasets for additional data.
Now we can create new pipeline with TfidfVectorizer
and SGDClassifier
:
vectorizer = TfidfVectorizer(
use_idf=True,
min_df=10,
max_features=100000,
ngram_range=(1, 3),
stop_words=get_stop_words('en'),
norm='l2')
clf = SGDClassifier(loss='modified_huber', alpha=0.0001, penalty='l2', max_iter=500, random_state=0)
initial_pipeline = Pipeline([
('vect', vectorizer),
('clf', clf)
])
initial_pipeline.fit(initial_train['Text'], initial_train['Target'])
print('Classification report:')
print(classification_report(initial_pipeline.predict(initial_test['Text']), initial_test['Target']))
print('\nROC-AUC Score: {}'.format(roc_auc_score(
initial_test['Target'],
initial_pipeline.predict_proba(initial_test['Text']),
multi_class='ovo')))
Output will looks like that:
Classification report:
precision recall f1-score support
comp.sys.ibm.pc.hardware 0.99 0.98 0.99 294
rec.sport.baseball 0.98 0.98 0.98 308
talk.politics.guns 0.98 0.99 0.98 264
accuracy 0.98 866
macro avg 0.98 0.99 0.99 866
weighted avg 0.99 0.98 0.98 866
ROC-AUC Score: 0.9994522360572858
And here we go to the main part of the article: adding new class to the existing classifier:
def generate_texts(tokens: List[str]) -> List[str]:
texts = [' '.join(x) for x in np.random.choice(tokens, (2000, 100))]
return texts
def extend_classifer(new_df: pd.DataFrame, initial_pipline: Pipeline) -> Pipeline:
initial_vect = initial_pipeline.named_steps['vect']
initial_model = initial_pipeline.named_steps['clf']
# generating new samples for existing classes from fitted TfidfVectorizer
generated_texts = generate_texts(initial_vect.get_feature_names())
# Create new SGDClassifier with same as previous parameters
clf = SGDClassifier(loss='modified_huber', alpha=0.0001, penalty='l2', max_iter=500, random_state=0)
# Create new dataframe with generated texts and new data
df = pd.DataFrame({'Text': generated_texts, 'Target': ['Other'] * len(generated_texts)})
new_df = pd.concat([new_df, df])
# Transform new texts with existing fitted vectorizer
X = initial_vect.transform(new_df['Text'])
y = new_df['Target']
clf.fit(X, y)
# Clone initial model
new_model = clone_model(initial_model)
# Update initial classifier's attributes
new_model.classes_ = np.append(initial_model.classes_, clf.classes_[1])
new_model.coef_ = np.append(initial_model.coef_, clf.coef_, axis=0)
new_model.intercept_ = np.append(initial_model.intercept_, clf.intercept_)
# New pipeline with updated classifier
return Pipeline([
('vect', initial_vect),
('clf', new_model)
])
Now we have to check new pipeline:
p = extend_classifer(additional_train, initial_pipeline)
test = pd.concat([initial_test, additional_test])
print('Classification report:')
print(classification_report(p.predict(test['Text']), test['Target']))
print('\nROC-AUC Score: {}'.format(roc_auc_score(
test['Target'],
p.predict_proba(test['Text']),
multi_class='ovo')))
Output:
Classification report:
precision recall f1-score support
comp.sys.ibm.pc.hardware 0.78 0.87 0.82 261
rec.sport.baseball 0.83 1.00 0.91 258
sci.electronics 0.88 0.64 0.74 409
talk.politics.guns 0.88 1.00 0.93 234
accuracy 0.84 1162
macro avg 0.84 0.88 0.85 1162
weighted avg 0.85 0.84 0.84 1162
ROC-AUC Score: 0.6682616497756867
Not exciting results, right?
In my real case I got more interesting results: f1-score for all classes decresed only for 1%. So, may be in your case it will work too?
Whole jupyter notebook can be found on my github.