Semarchy GenAI OpenAI structured enricher

The Semarchy GenAI OpenAI structured enricher extracts structured data from unstructured text to enhance data completeness and streamline the data entry process.

Plug-in ID

Semarchy GenAI OpenAI Structured Enricher - com.semarchy.engine.plugins.genai.openai.structured

Description

The GenAI OpenAI structured enricher is designed to extract structured data from unstructured text using OpenAI language models. It can generate or extract up to 20 outputs in JSON format, including strings, booleans, numbers, and dates.

Plug-in parameters

The following table lists the plug-in parameters.

Parameter name Mandatory Type Description

API Key

Yes

String

Client-side API key for establishing connectivity with the OpenAI API.

The API key is the only way to establish a connection to the OpenAI API.

Model

Yes

String

Language model to be used. Possible values are:

gpt-3.5-turbo
gpt-4
gpt-4o

Temperature

Number

Value ranging from 0 to 1 for balancing between conservative and coherent outputs (0) and creative variations (1) during text generation.
Default value: 0

Frequency Penalty

Number

Value between -2.0 and 2.0 used to discourage the model from repeatedly sampling the same sequences of tokens during text generation. Reasonable values typically range from 0.1 to 1.
Default value: 0

Max Tokens

Integer

Maximum number of tokens allowed in the generated output during text generation.
Default value: 50

The total length of input tokens and generated tokens is limited by the context length of the model.

Setting an insufficient number of tokens may result in a runtime error during response processing. Make sure to adjust the Max Tokens parameter to accommodate the number of generated tokens, considering your requirements and the terms of your OpenAI license.

Presence Penalty

Number

Value between -2.0 and 2.0 used to reduce the likelihood of the model generating repetitive sequences of tokens during text generation. Reasonable values typically range from 0.1 to 1.
Default value: 0

Top P

Number

Value ranging from 0 to 1 for defining the cumulative probability threshold for nucleus sampling (i.e., token selection).

Nucleus sampling is an alternative to temperature sampling. Ignore this parameter if you configured the Temperature parameter.

Max Retries

Integer

Maximum number of attempts allowed for API requests before considering them unsuccessful.
Default value: 3

Boolean output <N> (BOOLEAN_OUT_<N>)

String

Descriptor for the N^th boolean output in the structured output generated by the enricher, providing a description for the extracted boolean data (from 1 to 5).

Date output <N> (DATE_OUT_<N>)

String

Descriptor for the N^th date output in the structured output generated by the enricher, providing a description for the extracted date data (from 1 to 5).

Number output <N> (NUMBER_OUT_<N>)

String

Descriptor for the N^th number output in the structured output generated by the enricher, providing a description for the extracted number data (from 1 to 5).

String output <N> (STRING_OUT_<N>)

String

Descriptor for the N^th string output in the structured output generated by the enricher, providing a description for the extracted string data (from 1 to 5).

The enricher can return up to 20 outputs (five of each type).

The output descriptors are specifically designed to match the corresponding attribute types, whether they are dates, strings, numbers, or boolean values. For example, the Date output 1 descriptor exclusively matches date attributes. This matching process is automatically handled by the plug-in.

Language models

Language models are AI systems trained on vast amounts of text data to understand and generate human-like language, enabling tasks like text completion, translation, summarization, and sentiment analysis.

The OpenAI API offers a range of models with distinct capabilities. The models supported by the OpenAI enricher include:

gpt-3.5-turbo: the latest iteration of GPT-3.5 Turbo.
gpt-4: this model builds upon the advancements of GPT-3.5, with continuous upgrades.
gpt-4o: stands for GPT-4 Omni, an advanced, multimodal model with improved efficiency and accuracy, and superior performance in non-English language tasks.

All options theoretically point to the latest version of their respective model.

For more information on OpenAI models, see the official OpenAI documentation.

Tokens

Tokens are units of text that language models use to process and generate language. They can range from individual characters to entire words, depending on the language and the specific model being used.
For more information about tokens, see the official OpenAI documentation.

Plug-in inputs

The following table lists the plug-in inputs.

Input name	Mandatory	Type	Description
User Prompt	No	String	Instructions specifying the information to be extracted and the method for structuring the outputs accordingly.
Source Text for Extraction	No	String	Unstructured text from which structured data is extracted.
System Prompt	No	String	Initial instruction designed to guide the model towards specific topics, styles, tones, or formats of generated text.

Input name

Mandatory

Type

Description

User Prompt

String

Instructions specifying the information to be extracted and the method for structuring the outputs accordingly.

Source Text for Extraction

String

Unstructured text from which structured data is extracted.

System Prompt

String

Initial instruction designed to guide the model towards specific topics, styles, tones, or formats of generated text.

If you choose not to set a user prompt, you must enter a source text for extraction, and vice-versa. Defining either a user prompt or a source text for extraction is mandatory.

When opting for the Source Text for Extraction method, the enricher injects the provided text along with the configured descriptors (i.e., String output 1, Number output 1, etc.) into a standard user prompt. The pieces of information specified by the descriptors are then extracted from the text content.

When opting for the User Prompt method, model designers construct a user prompt containing unstructured values and output keys (i.e., STRING_OUT_1, NUMBER_OUT_1, etc.). These keys are then mapped to the relevant attributes in the plug-in output properties.

When formulating a user prompt, make sure to instruct the enricher to extract data in a structured JSON format.

For a detailed demonstration of these methods, see Examples and use cases.

Plug-in outputs

The following table lists the plug-in outputs.

Output name	Type	Description
Boolean output <N> (BOOLEAN_OUT_<N>)	String	Extracted boolean corresponding to the N^th boolean output descriptor, numbered from 1 to 5, and applied to a designated attribute.
Date output <N> (DATE_OUT_<N>)	String	Extracted date corresponding to the N^th date output descriptor, numbered from 1 to 5, and applied to a designated attribute.
Number output <N> (NUMBER_OUT_<N>)	String	Extracted number corresponding to the N^th number output descriptor, numbered from 1 to 5, and applied to a designated attribute.
String output <N> (STRING_OUT_<N>)	String	Extracted string corresponding to the N^th string output descriptor, numbered from 1 to 5, and applied to a designated attribute.

Output name

Type

Description

Boolean output <N> (BOOLEAN_OUT_<N>)

String

Extracted boolean corresponding to the N^th boolean output descriptor, numbered from 1 to 5, and applied to a designated attribute.

Date output <N> (DATE_OUT_<N>)

String

Extracted date corresponding to the N^th date output descriptor, numbered from 1 to 5, and applied to a designated attribute.

Number output <N> (NUMBER_OUT_<N>)

String

Extracted number corresponding to the N^th number output descriptor, numbered from 1 to 5, and applied to a designated attribute.

String output <N> (STRING_OUT_<N>)

String

Extracted string corresponding to the N^th string output descriptor, numbered from 1 to 5, and applied to a designated attribute.

Examples and use cases

AI-powered product data extraction: streamlining record creation

Imagine a scenario where a user wants to expedite product record creation by automatically extracting a product’s name, price, and country of origin from a detailed description. In practice, the user wants the Product Name, Price, and Country of Origin fields to be automatically populated based on the Description field’s content.

For instance, consider a new product record with the following description:
"The Aerodynamic Helmet by Velocity Bikes is expertly crafted in France for speed, style, and safety. Its sleek profile reduces drag while prioritizing rider protection. Priced at $129.99, it’s the ultimate choice for safety-conscious cyclists."

Two methods can be employed to achieve the desired result.

Using the Source Text for Extraction method, a model designer can configure the enricher as follows:
- In the plug-in properties:
  - String output 1 (STRING_OUT_1): Name of the product
  - String output 2 (STRING_OUT_2): Country of origin of the product
  - Number output 1 (NUMBER_OUT_1): Price of the product
- In the plug-in input properties:
  - Source Text for Extraction: Description
- In the plug-in output properties:
  - ProductName: String output 1 (STRING_OUT_1)
  - Origin: String output 2 (STRING_OUT_2)
  - Price: Number output 1 (NUMBER_OUT_1)
Using the User Prompt method, the model designer can configure the enricher as follows:
- In the plug-in input properties:
  - User Prompt: 'From ' || Description || ' extract the following information in a structured JSON format: STRING_OUT_1: the product name, STRING_OUT_2: the country of origin, NUMBER_OUT_1: the product price.'
- In the plug-in output properties:
  - ProductName: String output 1 (STRING_OUT_1)
  - Origin: String output 2 (STRING_OUT_2)
  - Price: Number output 1 (NUMBER_OUT_1)

Regardless of the method selected, the enricher’s response populates the Product Name, Country of Origin, and Price fields in the new record with the following information:

Product Name: Aerodynamic Helmet
Country of Origin: France
Price: 129.99