Natural language generation and patents: be ready!

Interview with François Veltz, Patent Attorney, Cofounder and C.E.O of qatent.com

Computational linguists and patent attorneys live in different worlds. Say “hyperonym”, “hyponym” or “meronym” to a patent attorney, and you are likely to get a dazzled look. Recent news about Natural Language Processing (e.g. about GPT-3) now can raise concerns that the patent world (applicants, attorneys, Offices) is likely to be shaken by new text generation capabilities. Moving away from simple word processing applications, we are soon going to experience more advanced tools, providing a whole new experience for patent claim or application drafting.

François Veltz

What is the qatent project aiming for?

qatent aims at developing intelligent tools to help patent practitioners. The tools make use of the latest generation of Natural Language Processing technology. The project is endorsed and supported by INRIA, a leading research lab in computer science in France. The scientific board is composed of Kim Gerdes, Professor in computational linguistics at Paris-Saclay University, Jean-Marc Deltorn, former patent examiner at the EPO for A.I. inventions and myself, patent attorney with a track record at IBM, Roche, and law firms. Our developers come from China, India, Korea, Algeria, Spain, Germany, and France.

What is text generation?

Text Generation, also called Natural Language Generation (NLG) is a sub-domain of Natural Language Processing (NLP). It aims at producing computer-generated human-readable texts. Texts can be generated based on textual input (summarization, text-to-speech systems, automatic speech recognition (ASR), machine translation) or based on non-textual data such as data-base content, numerical measures, or images.

As a simple example, consider what a spell-checker does: it compares the user input to a dictionary and for an unknown word, proposes the closest variant from the dictionary. When the user accepts the proposed term, the text is in a sense co-written by a machine-encoded knowledge about the language, simple orthographic information in this case. Moving to grammar correction, the task is already more demanding: In order to propose “he likes patents” for “he like patents”, the machine has to know the grammar and to understand the context in order to propose correct variants. The tool that analyzes the syntactic structure of a sentence is called a parser.

As many scientific domains, NLP has been changed profoundly by Big Data combined with Machine Learning, in particular Deep Learning. The availability of large amounts of electronic texts has allowed to train machines to understand the meaning of words and expressions. The popularization of distributed language representations, so-called word embeddings, in the early 2010s and, more recently, the advent of attention-based models and transformer architectures have opened new avenues for NLP. In this context, the release in open source of BERT by Google in 2018 has substantially accelerated the pace of change. In Natural Language Processing, BERT stands for Bidirectional Encoder Representations from Transformers. BERT models have been pre-trained on 800 million words from Google books and 2500 million words from the English Wikipedia.

What types of technologies are you using?

We use existing advanced NLP tools, from static word embeddings to contextual attention-based models that allow to exploit the specific structure of patent data. For example, we are using distributed language models that take – amongst other attributes – the technical contexts of occurrences of words into account. We compute CPC-class specific words embeddings which significantly improves the relevance and the precision of the generated language, an essential requirement to assist patent drafting.

How can text generation be used for patent handling?

Once you control or otherwise quantify natural language, a variety of services become possible, for example computer-assisted claim and patent drafting, including automated detection of unclaimed matter in applications (i.e. by comparing claims with the description), terminology suggestion based on general domain specific texts such as published patents or scientific texts, automatic classification, detection and quantification of similarities and plagiarism, various analytic services such as the detection of weak signals in textual time series (for example predicting booming terms or CPC classes), prosecution accelerators, etc. The list of possibilities is long.

At qatent we try to help patent attorneys draft their claims and applications.

Our process follows a principle of AI-assisted patent drafting, leaving the human in command. We start from a first handcrafted draft of the claims, which we try to facilitate using different tools (e.g. display of word definitions , suggestions, etc). The generation of the patent application is then continuously guided by the patent attorney, based on said claims. You not only have direct access to metrics about the current state of the text, to domain specific dictionaries, to guidelines and case law, but you are also assisted when looking for more specific or more general reformulations of terms and phrases. The editor also helps renumbering and detecting incorrect references to claims and terms, and it allows to easily store and retrieve intermediate versions of the text.

The qatent tools encode the patent attorney’s know-how and best practices acquired through our collective experience in drafting, in industry and law firms.

Is Natural Language Generation a threat for patents attorneys?

No, it’s rather an opportunity.

Drafting claims is an art, it is a compromise between legal, scientific and also business parameters. It will be long before the advent of Artificial General Intelligence, and for any foreseeable future, only patent attorneys are capable of making such arbitrages.

In our perspective, it is very likely that the role of patent attorneys will evolve, but humans will not be replaced by machines, they will be augmented by machines, just like in many other domains. At each step of the iterative generation, patent attorneys guide the content generation process and are in full control of the outcome.

What would you envision as the possible new role(s) of patent attorneys?

Backed up by AI, helped by visualization tools and quantitative assessments, patent attorneys will be able to concentrate more on claims (the scope of protection) and less on routine work (copy or annotation of claims in the description, labeling check, etc).

The cognitive tasks are solicited differently. Energy is required and focused on tasks of higher values. Significant parts of the burden can now be outsourced to the machine. When drafting, you have to parse a tree of alternative words. You are continuously challenging your current ideas against parameters or alternatives computed by machine. The cognitive approach is more combinatorial. It’s a new dialog between language and science, between humans and machines.

Can you detail more some of the functionalities?

While drafting your claims, you benefit from practical features, such as predictive typing (autocomplete), automatic renumbering of claims, or antecedence check, so that you can focus on higher value tasks.

Continuously, qatent suggests word and sentence variants when drafting claims (e.g. based on synonyms, more generic terms, more specific terms, alternatives, or other fallbacks). You are being proposed definitions of terms (extracted from different sources). The system also seamlessly checks for legal issues, as specified in the EPO guidelines or the MPEP (e.g. presence of relative terms, possible clarity objections, etc). As would also probably be useful for the e-EQE, a quick search allows you to search into the Convention at a first level, then in the Guidelines at a second level and then also in Case Law for G, T or J decisions. Prompts are on demand or automatically triggered, based on a set of user-customizable options.

Involving computers at the very heart of the drafting process is justified. Today’s patent attorneys still draft applications with a word processing software, which has not evolved for decades. Tomorrow’s drafters can take snapshots of intermediate versions, automatically import definitions, request the computer to reformulate a sentence or parts thereof, etc.

With AI-assisted drafting, we may see an increase in the density of content in a same draft. For example, in the generation process, we try to use what we call “corporate sedimentation”. This can go beyond user-customized templates. For a given vertical, and in line with the 18 months windows, it is conceivable to manage “evolutive boilerplates”, meaning that precedent inventions can be concisely and recurrently reincorporated. For example, suppose you work in the domain of Human-Machine Interfaces in avionics. Unless disruptive technologies occur (e.g. retina display), you may want to stack the different inventions you have been filing in recent years, and reuse them in combination in subsequent filings.

What about possible future functionalities?

Our roadmap remains open. Neural network technology in NLP is moving fast, and with them the possibilities to apply these technology to patent drafting. We work closely with patent attorneys from various domains and professional contexts.

Depending on the feedback, we may develop tools for patent prosecution, facilitating the answers to official communications. Yet, our main focus will remain on building an efficient assistant in the patent drafting process.

With respect to Wikipedia, can you detail what you are doing a bit more?

Wikipedia is one of many useful resources to build reference language models and to extract specific relationships between technical terms, but our focus is on building specialized language models based on technical texts such as the corpus of patents itself. We envision to operate on a larger scale, in particular by taking into account scientific publications. What is at stakes is to capture humanity’s scientific knowledge, in English and any other language.

What about inventorship, a hot topic in AI-generated inventions?

We have kept close tabs on the recent debates about AI-generated inventions. qatent is unaffected by these issues because in our approach the patent claims are continuously driven and controlled by human drafters assisted by machine. All generated text is parametrized by humans. qatent is a tool and inventors and authors are human beings, thus entitled to patent rights.

What are text generation limits? Can a generated text be novelty destroying or use for assessing inventive step?

The key principle to tackle this question is to know that computer-generated texts now hardly can be discriminated against handcrafted documents, provided texts are kept short. Controlled grammar and generation models can produce coherent texts, which until recently was not truly the case. In the case of long texts, there is no system able to produce semantically relevant texts. The risk of GPT3 producing semantically relevant inventions is low in our view. We bet on a different outcome: AI-assisted human drafting.

As any work produced by a man-machine collaboration, the result can be novelty destroying and used for inventive step attacks, of course.

What about drawings?

Simply put: we do not analyze, generate, interpret or handle drawings. We do not need them, at least for now. From a legal perspective, only words describing images are useful, and used. We do not deny possible advantages of drawings, in particular for readability or intelligibility of the invention, but at least for the time being we do not invest resources into image mining or generation.

What computations are you doing?

Our services are computationally intensive, even if some aspects are handcrafted. Our capital lies in part in computations we have done and continue performing. The substantial resources of the INRIA GPU clusters allow us to compute highly-specialized language models focusing on the specificities of the patent corpus. Computing a large-scale language model requires know-how and tuning.

What are the value propositions of using text generation in IP?

Language models extract deeply embedded insights from large amounts of texts and propose them to the patent attorney, who decides what to do with them. Before filing a patent, you may want to check our suggestions. In doing so, you make sure that you acknowledge relevant options and viable alternatives as identified by the machine and in fine optimize the scope of protection. We believe that this will become a must-check step.

Natural Language Processing may also lead to a certified quality insurance system. Today, both experienced attorneys and trainees are drafting patent applications. A number of errors can subsist, also because peer-review is not always feasible. With assistance systems, a higher level of quality control can be expected.

Today’s advanced language models not only assist patent attorneys at claim drafting, but provide them with lexical directions to improve the scope of protection.

Text generation may also lead to a higher level of “standardization” of drafts, which does not necessarily mean poorer or more focused drafts. With tight control on dictionaries and definitions, it may render patent production more homogeneous, or at least less dependent on the talent or the habits of the individual patent attorney. For a given company, the portfolio can also become more consistent for example.

Which impacts of NLP on patent laws can you foresee?

In our opinion, impacts will be numerous and diverse.

One positive consequence is that the “legalese”, which is in fact a type of obfuscation, can now be “decoded” by machines. In other words, some parts of sentences that are currently justified by Case Law (such as “... instructions which when executed by a processor cause said processor to perform the steps of”) may no longer be necessary. Some legalese may end up disappearing which is good news for accessibility to knowledge.

The “density” of prior art is likely to rise in our opinion, as machines will boost quantitative aspects (e.g., quantities of texts being produced). Whether search engines will follow the increase in quantity is an open (and critical) question.

Resorting to a wider range of language variants in the description may also allow to mitigate some of the stringent requirements of Article 123(2) EPC. Take the situation today: if the patent application has not described numerous adjacent developments (i.e. use of lists of words or individualized combinations), the strict “copy and paste” of legal support requirement can lead to a situation which is quite unfair: the inventor brought up something new and possibly inventive, but can hardly deviate from the initial wording. By using text generation, adjacent words can be captured and injected in drafts, thereby expanding the range of alternatives available in the original patent application. During prosecution, if need be, you have fallback options under the hood.

Likewise, claim construction can be objectivized (e.g. quantification of the quality of support, counts of occurrences, etc.)

What technical domains are you handling? Which ones are you not?

All patents are written in natural language and there is no inherent reason to believe that some technical domains will not be accessible to machines, at least to a certain extent. For now, we are generating texts in all IPC classes except C and D. Invention selections and chemical formulas diverge more from all-purpose texts and require specific pre-treatment. We plan to extent our team with a European Patent Attorney and an NLP specialist to handle inventions in the field of chemistry and biology.

What languages are you handling?

For now, we focus on English. If needed, high quality domain specific Neural Machine Translation can provide the final loop of the system, allowing us to ultimately offer our tools in any well-resourced language.

One of our objectives is to get our hands on the substance of technological insights. Most of the available patent and scientific corpora are in English, along Wikipedia dumps and other non-patent literature. We plan to ingest and possibly “digest” Asian scientific content, such as patent texts from China, Japan, and Korea. In other words, what matters is to catch the substance of the scientific ideas.

How do you train your models? Isn’t it a risk of cross-learning (between cases)?

Our models are trained independently from any user input. This prevents any possibility of contamination. Each client’s drafting experiment remains strictly confidential and does not affect in any way the construction of our models.

Do you intervene on patent claims?

We try to help practitioners writing better claims, by suggesting keywords, by flagging possible clarity issues, etc. The patent attorney has not less but more control over the text and there is no human intervention on our end. Part of our knowledge is hard-coded in generation rules. For the same software release, all users will have the same, user-configurable, experience.

What are your relations with patent offices?

The suite of tools we offer finds applications for patent offices as well as practitioners. In many ways, both parties face similar – or at least related – situations such as the identification of potential irregularities (e.g. in terms of clarity) or facilitating the drafting of communications. We plan to develop our relationship with patent granting authorities to showcase our products and identify potential avenues for collaboration.

Access to open patent data is of prime importance to the NLP community. Patent granting authorities have already taken the initiative to share a significant portion of their corpora. This is a welcome trend and we would be delighted to cooperate with all the parties involved to be at the forefront this evolution.

What about NLP at the European level?

The computational linguistics community is currently booming everywhere, with some excellence clusters emerging in Europe, Paris being one of them, others are in Prague, Saarbrücken or Edinburgh. We have no doubt that Americans, Chinese and other AI superpowers will grasp the opportunities opened by NLP applied to patents, which in turn might be a game changer for entire industries (e.g. quantum physics, robotics, vaccines).

What are your challenges today?

At the moment, we are focusing on code developments. Challenges today are related to the developments of language models specific for patents and of a smooth user experience with the editor.

In the short term, we are constituting focus groups to test and orient software developments. We are open to investors, in order to accelerate in our roadmap. If you are a decision maker in your industry and are willing to meet us, you are welcome.