Progression of data to insight
By Published On: March 22, 2023Categories: Blog, Work

Co-author: Jennifer Davis

So. Much. Data.

As an organization, you have no doubt amassed a large collection of documents. Legacy reports, analyses, proposals, invoices, and presentations, to name a few. If you’re like most organizations, these documents have been written by numerous authors over the life of your company, each contributing a unique voice and writing style. Many artifacts are digital, but some may remain in handwritten form. As independent entities, each contains data, perhaps even information. Together they constitute a corpus of documents. 

How do you unlock the information, knowledge, and insight within these documents in an efficient and repeatable manner? You need natural language processing.

What is Natural Language Processing?

Natural Language Processing (NLP) involves utilizing computers to process large amounts of text data and extract meaning from this data. Before utilizing Natural Language Processing, we must ingest our documents into a machine-readable format. This is a fairly straightforward process in some cases as the corpus of interest is already sitting on your laptop or in company drives. In other situations, the documents have to be translated into a machine-readable format typically using Optical Character Recognition (OCR), a process by which a machine “reads” a document and translates it into data it can further process.

We can then clean the documents using a host of processes. This may include removing stop words, and common words that provide no meaning to the corpus. We may tokenize the text, breaking the overall document into smaller segments known as tokens, and we may normalize the text using techniques such as lemmatization or stemming, which further process the text into information it can utilize to extract knowledge and insights.

Further processing depends on the desired goal of the analysis. Are we aiming to extract semantic meaning or syntactic meaning? Syntactic meaning typically comes from word or phrase similarity or presence. It may center on the number of times a particular word or phrase is present within a given document or corpus, or it may focus on a more sophisticated measure such as TF-IDF (Term Frequency-Inverse Document Frequency) which weights the presence of a particular word or phrase relative to how commonly it is used within the corpus of interest. In this case, rare language is given a higher weight for its uniqueness relative to common terms. It is typically relatively efficient, but as a result, lacks semantic context, and therefore, deeper meaning.

Semantic analysis focuses on the meaning resident within a document or corpus. Common steps include:

  1. Part-of-speech (POS) tagging which identifies what part of speech (e.g., noun, verb, adjective) a particular token is.
  2. Segmentation, which focuses on meaningful extraction of chunks of text within the document (e.g., sentences or paragraphs).
  3. Named entity recognition, which identifies and classifies entities such as people, locations, and organizations within a document.

We can also utilize word embeddings to associate similar words or concepts to one another. This approach vectorizes words and then measures them relative to one another in vector space where similar terms (e.g., king, queen) would be close together, whereas dissimilar words (e.g., bird, sandwich) would be far apart. These embeddings can be utilized for further analysis such as sentiment analysis, text classification (e.g., spam or not spam), and within a deep learning model such as GPT-3 for further analysis. Ultimately, semantic analyses are more computationally exhaustive but provide greater knowledge by highlighting connections between documents and insights across documents such as topics or themes resident within a corpus that only become apparent when looking at a corpus holistically.

Knowledge Graph of Machine Learning Concepts

In order to prove useful to non-Machine-Learning (ML) users, Natural Language Processing results must finally be visualized in some capacity. This may be as simple as a word cloud of common terms or a knowledge graph that connects the concepts at a higher level. The aim of an effective visualization is to augment the analyst’s capabilities in a way that allows them to be more effective.

4 Benefits of Natural Language Processing

We have discovered in our work that effective Natural Language Processing generally leads to the following benefits:

  1. Increased efficiency due to machine-driven document ingestion and analysis. Computers are fast at ingesting and processing documents. This leads to massive gains in efficiency as compared to a human reader, translating to more time spent on analytic tasks that matter to the user.
  2. Increased focus through relevant information highlighting driven by similarity analysis. Semantic or syntactic similarity analysis allows for the user to focus on information of interest, leading to knowledge and insight, by connecting this information across the corpus in ways that would prove difficult, if not impossible, to accomplish manually.
  3. Improved transparency driven by user-provided feedback and notes. To avoid the user feeling like a passive participant in the document analysis process, we actively encourage purposeful human-machine teaming. This can be accomplished in part by user documentation of decisions and rationale, leading to increased transparency in system results.
  4. Decreased bias driven by relevant information presentation fed by user requirements. The old modeling adage of “garbage in, garbage out” pertains as much to NLP as to any other modeling approach. Effective user requirement elicitation powers effective Natural Language Processing, which leads to relevant information extraction. Transparency of results also leads to decreased bias as users better understand how results were obtained.

3 Drawbacks of Natural Language Processing

If Natural Language Processing is so great, then, it may be natural to wonder why isn’t everyone isn’t using it. There are a number of drawbacks associated with the use of NLP that we’d like to address:

  1. Individuals fear that Natural Language Processing (or AI/ML) will replace their role. This is an unfounded fear. One must simply look at the latest press surrounding ChatGPT and its troubled roll-out to understand that rumors of an AI/ML takeover are greatly exaggerated. These tools can, however, effectively aid users in executing their job duties, freeing them up to concentrate on the cognitive work that they excel at.
  2. A belief that AI/ML needs a “babysitter.” We concede as much, at least in part. AI/ML models are not yet at the point in their maturity where we can simply “set and forget” them. However, we see this as a benefit and not a drawback. We want interaction between humans and machines. It is in this interaction that we believe both entities can serve their best purpose.
  3. A lack of trust in the accuracy or reliability of AI/ML findings. This is a fair criticism of these techniques, including Natural Language Processing. They should not be viewed as a panacea by any means. However, if deployed effectively and within appropriate bounds, they can prove extremely useful in the context of expediting document analysis workflow and allow for analysts to increase their effectiveness by helping them focus on the most relevant pieces of information that require human intervention.

Natural Language Processing in Practice

The HRA Tool with PII Removed

Mile Two joined the Human Resource Assistant (HRA) effort with AFRL/ACT3 in 2020. HRA aims to decrease the tedious workload of the staffing specialists, decrease biases during the resume evaluation process, and increase both reliability and consistency in hiring practices across the U.S. Government.

Staffing specialists are tasked with the manually intensive and time-consuming task of ensuring alignment between the job guidelines established by the US Office of Personnel Management and employment candidates. Using innovative user interface development, along with significant end-user input, and novel natural language processing capabilities, we were able to significantly improve the workflow for staffing specialists, who have indicated in preliminary assessments that the Human Resource Assistant increased focus on their work and decreased resume review time significantly. HRA is a prime example of the way Mile Two marries human-machine teaming with novel Natural Language Processing capabilities.

Artificial Intelligence at Mile Two

Mile Two specializes in Natural Language Processing and its use in human-machine teaming. We have used Natural Language Processing to work with organizations to support human resource personnel and staffing decisions, research information discovery, intelligence information gathering, and beyond. This allows us to utilize Natural Language Processing the best way it can–as an aid to the decision maker, increasing both their efficiency and effectiveness by extracting knowledge and insights resident with a corpus of interest. If you would like to discuss Natural Language Processing and machine learning or book a demo of our Human Resource Assistant tool, you can reach out here!

Subscribe to Our Blog!

[hubspot type=”form” portal=”20646130″ id=”332c340d-d8c0-43b5-bda1-be7299e26cfa”]


Join our email list.

Stay in the know.

Interested in our capabilities? Reach out!