This page walks through the basics of setting up a natural language processing (NLP) model and onboarding it to the Arthur system to monitor language-specific performance.
The first step is to import functions from the
arthurai package and establish a connection with Arthur.
# Arthur imports from arthurai import ArthurAI from arthurai.common.constants import InputType, OutputType, Stage arthur = ArthurAI(url="https://app.arthur.ai", login="<YOUR_USERNAME_OR_EMAIL>")
Registering an NLP Model#
Each NLP model is created with a name and with
input_type = InputType.NLP. Here, we register a classification model
on text specifying a
arthur_nlp_model = arthur.model(name="NLPQuickstart", input_type=InputType.NLP, model_type=OutputType.Multiclass, text_delimiter=TextDelimiter.NOT_WORD)
OutputType values currently supported for NLP models are classification, multi-labeling, and regression.
NLP models optionally allow specifying a
text_delimiter, which specifies how a raw document is split into tokens.
If a text delimiter is not provided, a default
text_delimiter will be
TextDelimiter.NOT_WORD. This delimiter will ignore punctuation and tokenize
text based only on the words present. However, if punctuation and non-word text needs to be considered by your model,
you should consider using other options for a delimiter to ensure those other pieces of text are processed by your NLP
For a full list of available text delimiters with examples, see the TextDelimiter constant documentation in our SDK reference.
Additionally, Arthur supports sending pre-tokenized text. For steps on registering tokens with Arthur, see our generative text walkthrough.
Formatting Reference/Inference Data#
Column names can contain only alphanumeric and underscore characters. The rest of the string values can have additional characters as raw text.
text_attr pred_value ground_truth non_input_1 0 'Here-is some text' 0.1 0 0.2 1 'saying a whole lot' 0.05 0 -0.3 2 'of important things!' 0.02 1 0.7 3 'With all kinds of chars?!' 0.2 0 0.1 ... 4 'But attribute/column names' 0.6 1 -0.6 5 'can only use underscore.' 0.9 1 -0.9 ...
Reviewing the Model Schema#
Before you register your model with Arthur by calling
arthur_model.save(), you can call
model schema to check that your data is parsed correctly.
For an NLP model, the model schema should look like this:
name stage value_type categorical is_unique 0 text_attr PIPELINE_INPUT UNSTRUCTURED_TEXT False True 1 pred_value PREDICTED_VALUE FLOAT False False ... 2 ground_truth GROUND_TRUTH INTEGER True False 3 non_input_1 NON_INPUT_DATA FLOAT False False ...
Once you have finished formatting your reference data and your model schema looks correct using
you are finished registering your model and its attributes - so you are ready to complete onboarding your model.
To finish onboarding your NLP model, the following steps apply, which is the same for NLP models as it is for models
Save your model
Send inferences your model has made on historical data
To confirm that the inferences have been sent, you can view your model and its inferences in the Arthur dashboard.
Connect your production data and model inference pipeline to Arthur
To see an example of saving your model and sending inference data, see the Arthur Quickstart.
To see multiple examples of connecting production data and model inference pipeline to Arthur, see our Integrations.
For an overview of configuring enrichments for NLP models, see the Enrichments guide.
For a step-by-step walkthrough of setting up the explainability Enrichment for NLP models, see NLP Explainability.