Trainable classifier

I usually blog only in Swedish but will from now on mix articles based on the content in Swedish or English. In this case it’s quite easy to choose since this article is about a new fantastic technic that can help your organization to find patterns in your information which can be basis for information classification. For now, this only works for information in English.

Why and when should we have use of trainable classifiers?

Today we can choose to either let the author of the information decide the correct information classification/label or we can automatically detect content like word, phrases or expression types that decide the correct information classification.

For example, if the document includes everything from predefined information types like credit card numbers, social security numbers to organization specific types, like project names or unique identifiers of drawings or recipes, it can be automatically classified. The classification can result in encryption or other data loss prevention technics.

But what if the organization can´t identify what´s unique for certain information? For example, if there have been projects that are working with sensitive information around a new invention but doesn’t have anything that is unique for all these documents?
The trainable classifier is meant just for these kinds of cases. Based on machine learning it can identify patterns by looking at existing documents. Basically, you point out a location that includes this information. Once it processes files in the location, you test the result by giving it a mix of both matching and non-matching documents and manually help the classifier to reduce the false positive predictions.

Let’s have a look in the compliance portal to find out more.

Trainable classified can be found under Data classification in Microsoft 365 compliance portal

Before you can start using this function in needs to scan your content and this can take up to 2 weeks to finish. I tried this in two different tenants and for our production tenant where we had a lot of data it took 8 days and in our test tenant with less data it took 12 days.

When the analytics is done you will find 6 predefined classifiers.

For instance, I can see high value of looking for “Threat” that detects a specific category of offensive language text containing threats to commit violence or do physical harm or damage to a person or property. One example of action here would be to identify and block email and chat with this kind of information.

Create a trainable classifier

Let’s create our own trainable classifier. The requirement is that the content needs to be stored in SharePoint Online. The file types that are supported are listed here. It needs to be 50 files at least, where the latest 500 files are the once that will be scanned (if the location includes more files).

I had no more imagination than sending up 50+ RMS logs which I changed the file format to .txt to be able to crawl these files. This is just for testing and demonstration but a business need here could be that sensitive log data needs to identified and retained for a specific time.

The syntax in those two fields is very strict and needs to be as follows:

As you can see in the following picture it can take up to 24 hours to analyze the content. You need to be patience testing this.

…after 3 hours I was able to start testing my classifier

I started out with creating two word documents, one with RMS log data and another one with only some characters from a log and other data that shouldn’t be identified as a RMS log. I uploaded these to another folder in SharePoint and added this location

The work wasn’t done after reviewing 2 files.. The portal now showed the following

As you can read under Classified accuracy above, It’s recommended (and also required) to test 200 items at least.

I uploaded 200+ files with a mix of correct logs and other type of log files and continued with the review.
When this was done it was possible to publish my classifier

I now have a new classifier ready to use that is showed together with the pre defined classifiers created by Microsoft.

Start using trainable classifier

Lets have a look in different types of action that can be taken based on the trainable classifier.

It can be used to auto-label a sensitivity label for Office apps:

The end user experience will be that, as soon as RMS log data is added to an Office document it will recommend (or automatically) change label and protect the file.

The classifier can be used as condition for a Retention Label (that can retain or delete content).
For now I had to use the classic Security & Compliance portal to be able to choose a trainable classifier as a condition for a Retention Label.

I can then create a Data Loss Prevention policy based on this retention label for SharePoint, OneDrive, Teams etc. To be able on act on data at rest in Office 365.

You can find more information and examples here

Trainable Classifier is another good example of technics that can help de business to identify and act on sensitive information. As always, it´s important to include the organizations information owners and appointed CISO or equivalent role in this work.