Mining information in unstructured text can be a real challenge. Patent documents, for example, provide a rich source of technological and scientific knowledge that can reveal technological trends as well as information on the legal landscape of the market. This makes analysis of the vast and ever-growing number of patents an important part of corporate business strategies.
One text mining technique incorporates latent variable methods and principles that are widely used in Multivariate Data Analysis (MVDA) to study collections of chemical patents represented as term-document matrices, or "bag-of-words." This approach illustrates how unstructured text can be converted into tabular data where the numerical values reflect word frequencies in the patents, making them available for deeper analysis. We'll look at an example using this data analytics technique for text mining and analysis of patent applications.
Applying data analytics techniques to text mining and analysis of patent submission details can help businesses understand technological trends and the legal landscape.
The challenge of text analysis
Efficient algorithms are a must when being swamped with information. For example, the European Patent Office (EPO) received 310,784 filings in 2017 alone. Traditional patent analysis and patent mining relies on the analysis of filing company, patent classification, grant year, etc. But gaining an insight into the contents of the patent requires more advanced text analytics, or text mining. This starts by transforming text data, which is unstructured, into structured data that is ready for analysis.
One common way of transforming text data is using so-called term-document matrices, or a bag-of-words model. Each row of the term-document matrix is a vector of the document’s term frequencies, which means that each document is represented as the counts of words within the document that may or may not be transformed. Patent documents have been analyzed using text analysis of bag-of-words in a wide variety of ways, including the identification of keywords and phrases, and network-based analysis that groups patents based on shared keyword distributions.
Principal Component Analysis for text analysis
When faced with a large collection of text documents, unsupervised text analysis gives valuable insights into the contents of the collection. One commonly used unsupervised text analysis method is Latent Semantic Analysis (LSA), which was originally developed to reduce the dimensionality of bag-of-words matrices. The purpose of the dimensionality reduction was to improve information retrieval and filtering by studying the underlying meanings of text bodies instead of directly observed word usage.
LSA is closely related to Principal Component Analysis (PCA), widely used in MVDA, since both are used to study underlying, so called latent, variables of data instead of directly observed single variables. LSA and PCA are also closely related mathematically. LSA is commonly performed by decomposing the term-document matrix algebraically using Singular Value Decomposition, which also is one way to calculate the components for a PCA model. The difference is that the PCA component score vectors are scaled by the singular values after decomposition. Another difference is that the variables are commonly centred to zero mean prior to PCA but not LSA.
At the Computational Life Science Cluster at Umeå University in Sweden, we reasoned that we can extract latent variables to improve the performance of text analytic systems and apply multivariate data analysis to improve the understanding of collections of chemical patents. To illustrate this, I’ll look at just one example in the paper we published: the analysis of patents specifically involving the food additive methylpyrazine.¹
A text mining example: Methylpyrazine
A common scenario for a patent analyst is to map the patent landscape regarding a specific topic. The purpose may be to define Freedom-To-Operate (FTO) in order to find out if there are any legal barriers to entering the market around a specific topic. In this case study we looked at how a patent analyst could use MVDA to access more information when studying patents on a specific topic.
We used the EPO’s Open Patent Service (OPS) to collect patents where the title or abstract contained the term methylpyrazine. We removed duplicates and patents lacking an English abstract, resulting in 114 patents. Apart from the blocks of text, a patent is classified according to the Cooperative Patent Classification (CPC), a commonly used classification scheme. The CPC is a hierarchical classification scheme with nine broad top-level categories with five levels of sub-categories. A patent is often described using more than one CPC-code, but patents can also lack CPC codes. This CPC code is useful in the interpretation stage of the analysis.
Our technique for text mining and analysis
We used a bag-of-words model with counts of single words and bigrams (continuous pairs of words) to enable multivariate analysis of patent abstracts. We only kept terms including nouns and removed all words or bigrams occurring in less than 5% of documents to reduce noise. Then we transformed the word and bigram counts using term frequency-inverse document frequency (tf-idf), which is widely used in information and text mining to adjust for the fact that some words are more common in all documents, leading to higher counts.
Using SIMCA data analytics tool, we got an overview of the patent collection with PCA and then performed a more in-depth analysis using Orthogonal Projections to Latent Structures (OPLS). We UV-scaled the tf-idf transformed bag-of-words model of patent abstracts prior to multivariate data analysis. The number of components was determined using SIMCA’s autofit functionality.
Figure 1: Overview of 114 patents, coloured according to CPC code, using PCA. The patents are divided into three main directions in the first two components, namely human necessities (CPC A), and chemical compounds and chemical methods (both CPC C).
Figure 2. A loading scatter plot of the terms that describes the patents in the overview PCA model.
Figure 3. OPLS score scatter plot of the first and third multi-label OPLS model. Using OPLS foodstuff and medical sciences are separated (both are part of human necessities).
Figure 4: OPLS loadings of top terms explaining foodstuff patents (A) and medical science patents (B).
PCA revealed that the patents could be divided into three broad categories: human necessities, chemical methods and chemical compounds. These categories could easily be deducted by interpreting the PCA loadings without investigating classification categories. Although many patents lack CPC codes, they could be classified using the semantic variation. We also used a multi-label adaption of OPLS-DA, since a single patent can have several CPC-codes, to separate patents one step further down the CPC-hierarchy for more fine-grained analysis. This enabled us to discover that human necessity patents could be sub-divided into foodstuff- and medical science-related patents.
The value of MVDA in text analysis
This example shows how multivariate data analysis can be used to better understand a patent collection and quickly reveal semantic patterns in patent texts. A similar approach can be valuable in mining information in other textual data, such as customer reviews, medical journals, scientific literature and maintenance reports.
Learn more about OPLS and MVDA
Download a free chapter from our book "Multi- and Megavariate Data Analysis". The chapter introduces the data analysis method orthogonal partial least squares (OPLS).
1. Multivariate patent analysis— Using chemometrics to analyze collections of chemical and pharmaceutical patents. R Sjögren, K Stridh, T Skotare, and J Trygg. 10 May 2018. J. Chemometrics.