Co-author: Jennifer Davis
As an organization, you have no doubt amassed a large collection of documents. Legacy reports, analyses, proposals, invoices, and presentations, to name a few. If you’re like most organizations, these documents have been written by numerous authors over the life of your company, each contributing a unique voice and writing style. Many artifacts are digital, but some may remain in handwritten form. As independent entities, each contains data, perhaps even information. Together they constitute a corpus of documents.
How do you unlock the information, knowledge, and insight within these documents in an efficient and repeatable manner? You need natural language processing.
Natural Language Processing (NLP) involves utilizing computers to process large amounts of text data and extract meaning from this data. Before utilizing Natural Language Processing, we must ingest our documents into a machine-readable format. This is a fairly straightforward process in some cases as the corpus of interest is already sitting on your laptop or in company drives. In other situations, the documents have to be translated into a machine-readable format typically using Optical Character Recognition (OCR), a process by which a machine “reads” a document and translates it into data it can further process.
We can then clean the documents using a host of processes. This may include removing stop words, and common words that provide no meaning to the corpus. We may tokenize the text, breaking the overall document into smaller segments known as tokens, and we may normalize the text using techniques such as lemmatization or stemming, which further process the text into information it can utilize to extract knowledge and insights.
Further processing depends on the desired goal of the analysis. Are we aiming to extract semantic meaning or syntactic meaning? Syntactic meaning typically comes from word or phrase similarity or presence. It may center on the number of times a particular word or phrase is present within a given document or corpus, or it may focus on a more sophisticated measure such as TF-IDF (Term Frequency-Inverse Document Frequency) which weights the presence of a particular word or phrase relative to how commonly it is used within the corpus of interest. In this case, rare language is given a higher weight for its uniqueness relative to common terms. It is typically relatively efficient, but as a result, lacks semantic context, and therefore, deeper meaning.
Semantic analysis focuses on the meaning resident within a document or corpus. Common steps include:
We can also utilize word embeddings to associate similar words or concepts to one another. This approach vectorizes words and then measures them relative to one another in vector space where similar terms (e.g., king, queen) would be close together, whereas dissimilar words (e.g., bird, sandwich) would be far apart. These embeddings can be utilized for further analysis such as sentiment analysis, text classification (e.g., spam or not spam), and within a deep learning model such as GPT-3 for further analysis. Ultimately, semantic analyses are more computationally exhaustive but provide greater knowledge by highlighting connections between documents and insights across documents such as topics or themes resident within a corpus that only become apparent when looking at a corpus holistically.
In order to prove useful to non-Machine-Learning (ML) users, Natural Language Processing results must finally be visualized in some capacity. This may be as simple as a word cloud of common terms or a knowledge graph that connects the concepts at a higher level. The aim of an effective visualization is to augment the analyst’s capabilities in a way that allows them to be more effective.
We have discovered in our work that effective Natural Language Processing generally leads to the following benefits:
If Natural Language Processing is so great, then, it may be natural to wonder why isn’t everyone isn’t using it. There are a number of drawbacks associated with the use of NLP that we’d like to address:
Mile Two joined the Human Resource Assistant (HRA) effort with AFRL/ACT3 in 2020. HRA aims to decrease the tedious workload of the staffing specialists, decrease biases during the resume evaluation process, and increase both reliability and consistency in hiring practices across the U.S. Government.
Staffing specialists are tasked with the manually intensive and time-consuming task of ensuring alignment between the job guidelines established by the US Office of Personnel Management and employment candidates. Using innovative user interface development, along with significant end-user input, and novel natural language processing capabilities, we were able to significantly improve the workflow for staffing specialists, who have indicated in preliminary assessments that the Human Resource Assistant increased focus on their work and decreased resume review time significantly. HRA is a prime example of the way Mile Two marries human-machine teaming with novel Natural Language Processing capabilities.
Mile Two specializes in Natural Language Processing and its use in human-machine teaming. We have used Natural Language Processing to work with organizations to support human resource personnel and staffing decisions, research information discovery, intelligence information gathering, and beyond. This allows us to utilize Natural Language Processing the best way it can–as an aid to the decision maker, increasing both their efficiency and effectiveness by extracting knowledge and insights resident with a corpus of interest. If you would like to discuss Natural Language Processing and machine learning or book a demo of our Human Resource Assistant tool, you can reach out here!