Back to blog

LitCoin NLP Challenge by NCATS & NASA

Natural Language Processing

LitCoin NLP Challenge by NCATS & NASA

Can you identify biomedical entities in research titles and abstracts?

Photo by Adrien Converse on Unsplash

In the past two competitions, you predicted video popularity and viral tweets with data science, now it’s time for a whole new challenge!

Bitgrit has released a NLP Challenge with a prize pool of $100,000 💵!

The first phase of this competition ends on 23rd December 2021, so sign up to get access to the data, and follow along with this article to get started!

The goal 🥅

Develop a NLP model that identifies mentions of biomedical 🧬 entities in research abstracts.

There are two parts to this competition.

(The following information is taken from the website)

Part 1: Given only an abstract text, the goal is to find all the nodes or biomedical entities (position in text and BioLink Model Category).

As stated in the website, The type of the biomedical entities comes from the BioLink Model Categories, and can be one and only one of the following

  • DiseaseOrPhenotypicFeature
  • ChemicalEntity
  • OrganismTaxon
  • GeneOrGeneProduct
  • SequenceVariant
  • CellLine

Part 2: Given the abstract and the nodes annotated from it, the goal is to find all the relationships between them (pair of nodes, BioLink Model Predicate and novelty).

From the description of the competition:

Each phase of the competition is designed to spur innovation in the field of natural language processing, asking competitors to design systems that can accurately recognize scientific concepts from the text of scientific articles, connect those concepts into knowledge assertions, and determine if that claim is a novel finding or background information.

This article will be focusing on the first part, which is to identify biomedical entities.

What does the data look like?

📂 LitCoin DataSet
 ├── abstracts_train.csv
 ├── entities_train.csv
 ├── relations_train.csv
 └── abstracts_test.csv

A little info about the data:

  • abstracts_train is where you’ll find the title and abstract of biomedical journal articles.
  • entities_train has the 6 categories that you need to predict, along with the offset positions of the word in the string with the title abstract combined.
  • relations_train is for phase 2 of the competition, so don’t worry about this data for now.

Relationship between the data

The relationship is simple in this dataset, entities is related to abstracts through the abstract_id, and to relations through the entity_ids

Below is a visualization of the relationship.

Image by Author (Created with drawsql)

More info about the data in the guidelines section of the competition.

Now that you have an idea about the goal and some information about the data given to you, it’s time to get your hands dirty.

All the code can be found in Google collab or on Deepnote.

Named Entity Recognition (NER)

There are different NLP tasks that solve different problems — text summarization, part of speech tagging, translation, and more.

For this particular problem, we are solving the problem of Named Entity Recognition, specifically on biomedical entities.

What is it?

According to good ol’ wikipedia, NER is defined as

a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

For example, take the following sentence:

“Albert Einstein is a physicist who was born in Germany, on March 14, 1870”

If you read it, you’d immediately be able to classify the named entities into the following categories:

  • person: Albert Einstein
  • job title: physicist
  • location: Germany
  • date: March 14, 1870

While it was simple for us humans to identify and categorize, computers need NLP to comprehend human language.

This is what NER does – it identifies and segments the main entities in a text. The entire goal is that computers extract relevant information from a large pile of unstructured text data.

With enough data, you can train NER models that are able to classify and segment these entities with a high accuracy. And those models are able to produce visualizations like below.

Image by author (Made with excalidraw)

How?

There are different libraries and packages such as spaCy and NLTK that allow you to perform NER, and many pretrained NER models with different approaches are also available online.

Since our problem is more specific for biomedical text, we will be using scispaCy, a Python package containing spaCy models for processing biomedical, scientific or clinical text.

scispaCy comes with the following pre-trained models for you to use.

pre-trained models in scispaCy (Source)

Notice there are 4 NER models that are trained on different corpus of biomedical articles.

The documentation also provides us the entity types that the pretrained model predicts.

Entity types of models (Source)

Knowing the entity types and what they represent will be useful soon. But for now, let us install the necessary libraries and dive into the code.

Installing libraries and models

# install >= 3.0.1 spacy version
!pip install spacy==3.0.1

# install scispacy
!pip install scispacy

# install models
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_craft_md-0.4.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_jnlpba_md-0.4.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz

To use scispacy, we’ll need spacy to be at least version 3.0.1

Then, we’ll install scispacy and the four models.

Load Libraries

# essentials
import pandas as pd
import numpy as np

# spacy
import scispacy
import spacy

# display results
from spacy import displacy

# scispacy models
import en_ner_craft_md
import en_ner_jnlpba_md
import en_ner_bc5cdr_md
import en_ner_bionlp13cg_md

# utility
from pprint import pprint

# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline

We’ll then import the necessary libraries, and the models as well.

Import the data

data_path = 'PATH_TO_DATA'

# Load training datasets
abstracts_train = pd.read_csv(data_path + 'abstracts_train.csv', sep='\t')
entities_train = pd.read_csv(data_path + 'entities_train.csv', sep='\t')

# Load test data
abstracts_test = pd.read_csv(data_path + 'abstracts_test.csv', sep='\t')

The csv files are seperated by tabs, so we use the sep parameter to specify that.

# print dimensions of data
print('Dimension of abstracts train: ', abstracts_train.shape)
print('Dimension of entities train: ', entities_train.shape)
print()
print('Dimension of abstracts test: ', abstracts_test.shape)
Dimension of abstracts train:  (400, 3)
Dimension of entities train:  (13636, 7)

Dimension of abstracts test:  (100, 3)

EDA

Abstract data

abstracts_train.info()
abstracts_train.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   abstract_id  400 non-null    int64 
 1   title        400 non-null    object
 2   abstract     400 non-null    object
dtypes: int64(1), object(2)
memory usage: 9.5+ KB
abstract_idtitleabstract
01353340Late-onset metachromatic leukodystrophy: molec…We report on a new allele at the arylsulfatase…
11671881Two distinct mutations at a single BamHI site …Classical phenylketonuria is an autosomal rece…
21848636Debrisoquine phenotype and the pharmacokinetic…The metabolism of the cardioselective beta-blo…
32422478Midline B3 serotonin nerves in rat medulla are…Previous experiments in this laboratory have s…
42491010Molecular and phenotypic analysis of patients …Eighty unrelated individuals with Duchenne mus…

From the output, we have exactly 400 title and abstract data to use for predicting entity types.

Entities Data

entities_train.info()
entities_train.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13636 entries, 0 to 13635
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             13636 non-null  int64 
 1   abstract_id    13636 non-null  int64 
 2   offset_start   13636 non-null  int64 
 3   offset_finish  13636 non-null  int64 
 4   type           13636 non-null  object
 5   mention        13636 non-null  object
 6   entity_ids     13636 non-null  object
dtypes: int64(4), object(3)
memory usage: 745.8+ KB
idabstract_idoffset_startoffset_finishtypementionentity_ids
0013533401139DiseaseOrPhenotypicFeaturemetachromatic leukodystrophyD007966
111353340111126GeneOrGeneProductarylsulfatase A410
221353340128132GeneOrGeneProductARSA410
331353340159187DiseaseOrPhenotypicFeaturemetachromatic leukodystrophyD007966
441353340189192DiseaseOrPhenotypicFeatureMLDD007966

A peek at our entities data shows us we have over 13k entities that were extracted from the title and abstract string.

Checking for missing values

Using a helper function I coded, there are no missing values in the dataset.

missing_cols(abstracts_train)
missing_cols(entities_train)
no missing values
no missing values

How many Entity types are there?

entities_train['type'].value_counts().plot(kind="barh").invert_yaxis();

With a simple bar plot, it seems that GeneOrGeneProduct is the most common type.

Now that we understand our data a little bit better, let’s start looking at what scispaCy can do.

scispaCy in action

Let’s start with taking the first title and abstract string.

text = abstracts_train.iloc[0].title + abstracts_train.iloc[0].abstract
len(text)

717

The first string has 717 characters.

Printing it below, we can see that string.

text

'Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.We report on a new allele at the arylsulfatase A (ARSA) locus causing late-onset metachromatic leukodystrophy (MLD). In that allele arginine84, a residue that is highly conserved in the arylsulfatase gene family, is replaced by glutamine. In contrast to alleles that cause early-onset MLD, the arginine84 to glutamine substitution is associated with some residual ARSA activity. A comparison of genotypes, ARSA activities, and clinical data on 4 individuals carrying the allele of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD.. '

Loading the model

Now let’s load up the first model which is trained on the BioNLP13CG Corpus — bionlp13cg_md, and pass our text into the model.

nlp = en_ner_bionlp13cg_md.load()
doc_bionlp13cg = nlp(text)

We now have a document object that contains information about the entities.

Getting entities

Calling .ents on it, we can see the entities it extracted.

doc_bionlp13cg.ents


(arylsulfatase A,
 ARSA),
 MLD,
 arginine84,
 arylsulfatase,
 glutamine,
 MLD,
 arginine84,
 glutamine,
 ARSA,
 ARSA,
 individuals,
 patients,
 MLD,
 ARSA,
 MLD)

Visualizing entities with labels

We can even have spaCy visualize the entities and the labels right on our title+abstract string. This is done with the displacy function

displacy.render(doc_bionlp13cg_md, jupyter=True, style='ent')


Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.We report on a new allele at the arylsulfatase A GENE_OR_GENE_PRODUCT ( ARSA) GENE_OR_GENE_PRODUCT locus causing late-onset metachromatic leukodystrophy ( MLD CANCER ). In that allele arginine84 GENE_OR_GENE_PRODUCT , a residue that is highly conserved in the arylsulfatase GENE_OR_GENE_PRODUCT gene family, is replaced by glutamine AMINO_ACID . In contrast to alleles that cause early-onset MLD GENE_OR_GENE_PRODUCT , the arginine84 GENE_OR_GENE_PRODUCT to glutamine AMINO_ACID substitution is associated with some residual ARSA GENE_OR_GENE_PRODUCT activity. A comparison of genotypes, ARSA GENE_OR_GENE_PRODUCT activities, and clinical data on 4 individuals ORGANISM carrying the allele of 81 patients ORGANISM with MLD GENE_OR_GENE_PRODUCT examined, further validates the concept that different degrees of residual ARSA GENE_OR_GENE_PRODUCT activity are the basis of phenotypical variation in MLD GENE_OR_GENE_PRODUCT .. 

The document entities also has the attributes text, label_, start_char, and end_char which are important information we need for this challenge.

pprint({(X.text, X.label_, X.start_char, X.end_char) for X in doc_bionlp13cg_md.ents})

{('ARSA', 'GENE_OR_GENE_PRODUCT', 441, 445),
 ('ARSA', 'GENE_OR_GENE_PRODUCT', 483, 487),
 ('ARSA', 'GENE_OR_GENE_PRODUCT', 654, 658),
 ('ARSA)', 'GENE_OR_GENE_PRODUCT', 127, 132),
 ('MLD', 'CANCER', 188, 191),
 ('MLD', 'GENE_OR_GENE_PRODUCT', 362, 365),
 ('MLD', 'GENE_OR_GENE_PRODUCT', 575, 578),
 ('MLD', 'GENE_OR_GENE_PRODUCT', 711, 714),
 ('arginine84', 'GENE_OR_GENE_PRODUCT', 209, 219),
 ('arginine84', 'GENE_OR_GENE_PRODUCT', 371, 381),
 ('arylsulfatase', 'GENE_OR_GENE_PRODUCT', 263, 276),
 ('arylsulfatase A', 'GENE_OR_GENE_PRODUCT', 110, 125),
 ('glutamine', 'AMINO_ACID', 305, 314),
 ('glutamine', 'AMINO_ACID', 385, 394),
 ('individuals', 'ORGANISM', 523, 534),
 ('patients', 'ORGANISM', 561, 569)}

And voila! You’ve extracted entities from a biomedical paper title and abstract with a pretrained NER model from scispaCy!

Now let’s see what entities the other 3 models extracts.

nlp_craft = en_ner_craft_md.load()
nlp_jnlpba = en_ner_jnlpba_md.load()
nlp_bc5cdr = en_ner_bc5cdr_md.load()

doc_craft = nlp_craft(text)
doc_jnlpba = nlp_jnlpba(text)
doc_bc5cdr= nlp_bc5cdr(text)

CRAFT Corpus model

displacy.render(doc_craft, jupyter=True, style="ent")

Late-onset metachromatic leukodystrophy: molecular CHEBI pathology in two siblings.We report on a new allele SO at the arylsulfatase A GGP (ARSA) locus SO causing late-onset metachromatic leukodystrophy (MLD). In that allele SO arginine84, a residue that is highly conserved SO in the arylsulfatase gene GGP family, is replaced SO by glutamine CHEBI . In contrast to alleles SO that cause early-onset MLD, the arginine84 to glutamine substitution CHEBI is associated with some residual ARSA activity. A comparison of genotypes SO , ARSA GGP activities, and clinical data on 4 individuals TAXON carrying the allele SO of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD.. 

JNLPBA corpus model

displacy.render(doc_jnlpba, jupyter=True, style='ent')

Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.We report on a new allele DNA at the arylsulfatase A (ARSA) locus DNA causing late-onset metachromatic leukodystrophy (MLD). In that allele arginine84 DNA , a residue that is highly conserved in the arylsulfatase gene family DNA , is replaced by glutamine. In contrast to alleles that cause early-onset MLD, the arginine84 PROTEIN to glutamine substitution is associated with some residual ARSA PROTEIN activity. A comparison of genotypes, ARSA PROTEIN activities, and clinical data on 4 individuals carrying the allele DNA of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA PROTEIN activity are the basis of phenotypical variation in MLD.. 

BC5CDR corpus model

displacy.render(doc_bc5cdr, jupyter=True, style='ent')

 Late-onset metachromatic leukodystrophy DISEASE : molecular pathology in two siblings.We report on a new allele at the arylsulfatase A (ARSA) locus causing late-onset metachromatic leukodystrophy DISEASE ( MLD DISEASE ). In that allele arginine84, a residue that is highly conserved in the arylsulfatase gene family, is replaced by glutamine CHEMICAL . In contrast to alleles that cause early-onset MLD DISEASE , the arginine84 to glutamine CHEMICAL substitution is associated with some residual ARSA activity. A comparison of genotypes, ARSA activities, and clinical data on 4 individuals carrying the allele of 81 patients with MLD DISEASE examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD DISEASE .. 

Now let’s take all these entity categorizations and put them all together.

data_doc_bionlp13cg = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_bionlp13cg.ents]
data_doc_jcraft = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_jcraft.ents]
data_doc_bc5cdr = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_bc5cdr.ents]
data_doc_jnlpba = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_jnlpba.ents]

data = data_doc_bionlp13cg + data_doc_jcraft + data_doc_bc5cdr + data_doc_jnlpba

Now let’s create a data frame for our combined extracted entities, with the text, label, starting character, and the ending character.

attrs = ["text", "label_", "start_char", "end_char"]
temp_df = pd.DataFrame(data, columns=attrs)

Then, we compare it with the entity type given in our training data.

temp_train_df = entities_train.query('abstract_id == 1353340')
temp_train_df.head()
idabstract_idoffset_startoffset_finishtypementionentity_ids
0013533401139DiseaseOrPhenotypicFeaturemetachromatic leukodystrophyD007966
111353340111126GeneOrGeneProductarylsulfatase A410
221353340128132GeneOrGeneProductARSA410
331353340159187DiseaseOrPhenotypicFeaturemetachromatic leukodystrophyD007966
441353340189192DiseaseOrPhenotypicFeatureMLDD007966

We can compare them side by side with an inner join between the two data frames.

merged_df = temp_df.merge(temp_train_df, how = 'inner', left_on ='text', right_on = 'mention')
merged_df[['text', 'label_', 'type']].drop_duplicates()
textlabel_type
0arylsulfatase AGENE_OR_GENE_PRODUCTGeneOrGeneProduct
1arylsulfatase AGGPGeneOrGeneProduct
2MLDCANCERDiseaseOrPhenotypicFeature
6MLDGENE_OR_GENE_PRODUCTDiseaseOrPhenotypicFeature
18MLDDISEASEDiseaseOrPhenotypicFeature
34arginine84GENE_OR_GENE_PRODUCTSequenceVariant
36arginine84PROTEINSequenceVariant
37arylsulfataseGENE_OR_GENE_PRODUCTGeneOrGeneProduct
38ARSAGENE_OR_GENE_PRODUCTGeneOrGeneProduct
50ARSAGGPGeneOrGeneProduct
54ARSAPROTEINGeneOrGeneProduct
66patientsORGANISMOrganismTaxon

As for determining which to map to, you need to go through the biolink model, along with the documentation of the four corpus.

For example, based on the CRAFT corpus article, these are what the entity types stand for

  • GGP — GeneOrGenePhenotype
  • SO — Sequence Ontology
  • TAXON — NCBI Taxonomy
  • CHEBI — Chemical Entities of Biological Interest
  • GO — Gene Ontology (biological process, cellular component, and molecular function)
  • CL — Cell Line

To change the values in the data, we can use the map function from pandas.

temp_df.label_ = temp_df.label_.map(
    {
        "GENE_OR_GENE_PRODUCT": "GeneOrGeneProduct",
        "GGP": "GeneOrGeneProduct",
        "ORGANISM": "OrganismTaxon",
        "CANCER": "DiseaseOrPhenotypicFeature",
        "DISEASE": "DiseaseOrPhenotypicFeature",
        "CHEBI": "ChemicalEntity",
        "CHEMICAL": "ChemicalEntity",
        "PROTEIN": "SequenceVariant",
        "AMINO_ACID": "SequenceVariant",
        "SO": "SO",
        "TAXON": "TAXON",
        "DNA": "DNA",
    }
)

Note: SO, TAXON, and DNA are mapped to the same value because map function requires all values to be mapped to something.

The end product is similar to what you need to submit to the competition, except it’s still missing id, abstract_id, and the column names need to be renamed.

temp_df.head()
textlabel_start_charend_char
0arylsulfatase AGeneOrGeneProduct110125
1ARSA)GeneOrGeneProduct127132
2MLDDiseaseOrPhenotypicFeature188191
3arginine84GeneOrGeneProduct209219
4arylsulfataseGeneOrGeneProduct263276

The challenge now is to do that for the rest of the title and abstracts.

Conclusion

Note that this article introduced just one approach to this problem, and I do not advocate that it’s the best solution.

Here are a some ways to improve upon the given approach

  1. Use the data to tune the frequency of concepts per sentence/abstract
  2. Correct some misclassifications based on how some categories are done in the data and from reading the CORPUS and biolink documentation
  3. Use data to know which ontologies to search specifically

All the best in this competition and I’ll see you in the second phase!

Want to discuss about the challenge with other data scientists? Join the discord server!

Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!