Automated Pipeline for Training Language Models
Summary
USPTO has published patent application US20260099674A1 assigned to U.S. Bank National Association, describing a method and system for generating training datasets for language models using an automated pipeline. The system receives input samples, performs rephrasing operations via a generative LLM to produce semantically equivalent versions with different phrasing, labels entity references, and aggregates them into an expanded labeled dataset for natural language processing model training.
What changed
USPTO has published patent application US20260099674A1 assigned to U.S. Bank National Association, describing an automated pipeline for generating training datasets for language models. The system receives input samples, performs rephrasing operations using a generative LLM to produce semantically equivalent versions with different phrasing, labels entity references in the generated versions, and aggregates them into an expanded labeled dataset for NLP model training.
This document is informational in nature and does not create compliance obligations for other entities. It represents a patent filing by a commercial bank related to AI/LLM training data generation technology.
Archived snapshot
Apr 18, 2026GovPing captured this document from the original source. If the source has since changed or been removed, this is the text as it existed at that time.
AUTOMATED PIPELINE FOR TRAINING LANGUAGE MODELS
Application US20260099674A1 Kind: A1 Apr 09, 2026
Assignee
U.S. Bank National Association
Inventors
Giacomo Domeniconi, Ali Fathi, Samuel A. Assefa, Kausik Gangopadhyay, Christopher Taggert, Samuel Atkins
Abstract
The disclosed embodiments describe a method, system, and computer-readable medium for generating a training dataset for training a model in the field of natural language processing involving receiving a set of input samples and performing a rephrasing operation to produce new versions of the set of input samples, where the new versions preserve semantic equivalence as the set of input samples but have different phrasing. A dataset of generated versions of the input samples is generated using a generative Language Learning Model (LLM), all entity references present in the generated versions of the input samples are labeled, and the generated versions of the input samples and their corresponding labeled versions to form an expanded labeled dataset are aggregated.
CPC Classifications
G06F 40/295 G06F 40/284 G06F 40/40 G06N 3/0475 G06N 3/094
Filing Date
2024-10-08
Application No.
18909206
Related changes
Get daily alerts for USPTO Patent Applications - AI & Computing (G06N)
Daily digest delivered to your inbox.
Free. Unsubscribe anytime.
Source
About this page
Every important government, regulator, and court update from around the world. One place. Real-time. Free. Our mission
Source document text, dates, docket IDs, and authority are extracted directly from USPTO.
The summary, classification, recommended actions, deadlines, and penalty information are AI-generated from the original text and may contain errors. Always verify against the source document.
Classification
Who this affects
Taxonomy
Browse Categories
Get alerts for this source
We'll email you when USPTO Patent Applications - AI & Computing (G06N) publishes new changes.
Subscribed!
Optional. Filters your digest to exactly the updates that matter to you.