Natural Language Processing
LitCoin NLP Challenge by NCATS & NASA
Can you identify biomedical entities in research titles and abstracts?
In the past two competitions, you predicted video popularity and viral tweets with data science, now it’s time for a whole new challenge!
Bitgrit has released a NLP Challenge with a prize pool of $100,000 💵!
The first phase of this competition ends on 23rd December 2021, so sign up to get access to the data, and follow along with this article to get started!
The goal 🥅
Develop a NLP model that identifies mentions of biomedical 🧬 entities in research abstracts.
There are two parts to this competition.
(The following information is taken from the website)
Part 1: Given only an abstract text, the goal is to find all the nodes or biomedical entities (position in text and BioLink Model Category).
As stated in the website, The type of the biomedical entities comes from the BioLink Model Categories, and can be one and only one of the following
- DiseaseOrPhenotypicFeature
- ChemicalEntity
- OrganismTaxon
- GeneOrGeneProduct
- SequenceVariant
- CellLine
Part 2: Given the abstract and the nodes annotated from it, the goal is to find all the relationships between them (pair of nodes, BioLink Model Predicate and novelty).
From the description of the competition:
Each phase of the competition is designed to spur innovation in the field of natural language processing, asking competitors to design systems that can accurately recognize scientific concepts from the text of scientific articles, connect those concepts into knowledge assertions, and determine if that claim is a novel finding or background information.
This article will be focusing on the first part, which is to identify biomedical entities.
What does the data look like?
📂 LitCoin DataSet ├── abstracts_train.csv ├── entities_train.csv ├── relations_train.csv └── abstracts_test.csv
A little info about the data:
abstracts_train
is where you’ll find the title and abstract of biomedical journal articles.entities_train
has the 6 categories that you need to predict, along with the offset positions of the word in the string with the title abstract combined.relations_train
is for phase 2 of the competition, so don’t worry about this data for now.
Relationship between the data
The relationship is simple in this dataset, entities is related to abstracts through the abstract_id
, and to relations through the entity_ids
Below is a visualization of the relationship.
More info about the data in the guidelines section of the competition.
Now that you have an idea about the goal and some information about the data given to you, it’s time to get your hands dirty.
All the code can be found in Google collab or on Deepnote.
Named Entity Recognition (NER)
There are different NLP tasks that solve different problems — text summarization, part of speech tagging, translation, and more.
For this particular problem, we are solving the problem of Named Entity Recognition, specifically on biomedical entities.
What is it?
According to good ol’ wikipedia, NER is defined as
a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
For example, take the following sentence:
“Albert Einstein is a physicist who was born in Germany, on March 14, 1870”
If you read it, you’d immediately be able to classify the named entities into the following categories:
- person: Albert Einstein
- job title: physicist
- location: Germany
- date: March 14, 1870
While it was simple for us humans to identify and categorize, computers need NLP to comprehend human language.
This is what NER does – it identifies and segments the main entities in a text. The entire goal is that computers extract relevant information from a large pile of unstructured text data.
With enough data, you can train NER models that are able to classify and segment these entities with a high accuracy. And those models are able to produce visualizations like below.
How?
There are different libraries and packages such as spaCy and NLTK that allow you to perform NER, and many pretrained NER models with different approaches are also available online.
Since our problem is more specific for biomedical text, we will be using scispaCy, a Python package containing spaCy models for processing biomedical, scientific or clinical text.
scispaCy comes with the following pre-trained models for you to use.
Notice there are 4 NER models that are trained on different corpus of biomedical articles.
The documentation also provides us the entity types that the pretrained model predicts.
Knowing the entity types and what they represent will be useful soon. But for now, let us install the necessary libraries and dive into the code.
Installing libraries and models
# install >= 3.0.1 spacy version
!pip install spacy==3.0.1
# install scispacy
!pip install scispacy
# install models
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_craft_md-0.4.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_jnlpba_md-0.4.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bionlp13cg_md-0.4.0.tar.gz
To use scispacy, we’ll need spacy
to be at least version 3.0.1
Then, we’ll install scispacy and the four models.
Load Libraries
# essentials
import pandas as pd
import numpy as np
# spacy
import scispacy
import spacy
# display results
from spacy import displacy
# scispacy models
import en_ner_craft_md
import en_ner_jnlpba_md
import en_ner_bc5cdr_md
import en_ner_bionlp13cg_md
# utility
from pprint import pprint
# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
We’ll then import the necessary libraries, and the models as well.
Import the data
data_path = 'PATH_TO_DATA'
# Load training datasets
abstracts_train = pd.read_csv(data_path + 'abstracts_train.csv', sep='\t')
entities_train = pd.read_csv(data_path + 'entities_train.csv', sep='\t')
# Load test data
abstracts_test = pd.read_csv(data_path + 'abstracts_test.csv', sep='\t')
The csv files are seperated by tabs, so we use the sep
parameter to specify that.
# print dimensions of data
print('Dimension of abstracts train: ', abstracts_train.shape)
print('Dimension of entities train: ', entities_train.shape)
print()
print('Dimension of abstracts test: ', abstracts_test.shape)
Dimension of abstracts train: (400, 3)
Dimension of entities train: (13636, 7)
Dimension of abstracts test: (100, 3)
EDA
Abstract data
abstracts_train.info()
abstracts_train.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 abstract_id 400 non-null int64
1 title 400 non-null object
2 abstract 400 non-null object
dtypes: int64(1), object(2)
memory usage: 9.5+ KB
abstract_id | title | abstract | |
---|---|---|---|
0 | 1353340 | Late-onset metachromatic leukodystrophy: molec… | We report on a new allele at the arylsulfatase… |
1 | 1671881 | Two distinct mutations at a single BamHI site … | Classical phenylketonuria is an autosomal rece… |
2 | 1848636 | Debrisoquine phenotype and the pharmacokinetic… | The metabolism of the cardioselective beta-blo… |
3 | 2422478 | Midline B3 serotonin nerves in rat medulla are… | Previous experiments in this laboratory have s… |
4 | 2491010 | Molecular and phenotypic analysis of patients … | Eighty unrelated individuals with Duchenne mus… |
From the output, we have exactly 400 title and abstract data to use for predicting entity types.
Entities Data
entities_train.info()
entities_train.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13636 entries, 0 to 13635
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 13636 non-null int64
1 abstract_id 13636 non-null int64
2 offset_start 13636 non-null int64
3 offset_finish 13636 non-null int64
4 type 13636 non-null object
5 mention 13636 non-null object
6 entity_ids 13636 non-null object
dtypes: int64(4), object(3)
memory usage: 745.8+ KB
id | abstract_id | offset_start | offset_finish | type | mention | entity_ids | |
---|---|---|---|---|---|---|---|
0 | 0 | 1353340 | 11 | 39 | DiseaseOrPhenotypicFeature | metachromatic leukodystrophy | D007966 |
1 | 1 | 1353340 | 111 | 126 | GeneOrGeneProduct | arylsulfatase A | 410 |
2 | 2 | 1353340 | 128 | 132 | GeneOrGeneProduct | ARSA | 410 |
3 | 3 | 1353340 | 159 | 187 | DiseaseOrPhenotypicFeature | metachromatic leukodystrophy | D007966 |
4 | 4 | 1353340 | 189 | 192 | DiseaseOrPhenotypicFeature | MLD | D007966 |
A peek at our entities data shows us we have over 13k entities that were extracted from the title and abstract string.
Checking for missing values
Using a helper function I coded, there are no missing values in the dataset.
missing_cols(abstracts_train)
missing_cols(entities_train)
no missing values
no missing values
How many Entity types are there?
entities_train['type'].value_counts().plot(kind="barh").invert_yaxis();
With a simple bar plot, it seems that GeneOrGeneProduct is the most common type.
Now that we understand our data a little bit better, let’s start looking at what scispaCy can do.
scispaCy in action
Let’s start with taking the first title and abstract string.
text = abstracts_train.iloc[0].title + abstracts_train.iloc[0].abstract
len(text)
717
The first string has 717 characters.
Printing it below, we can see that string.
text
'Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.We report on a new allele at the arylsulfatase A (ARSA) locus causing late-onset metachromatic leukodystrophy (MLD). In that allele arginine84, a residue that is highly conserved in the arylsulfatase gene family, is replaced by glutamine. In contrast to alleles that cause early-onset MLD, the arginine84 to glutamine substitution is associated with some residual ARSA activity. A comparison of genotypes, ARSA activities, and clinical data on 4 individuals carrying the allele of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD.. '
Loading the model
Now let’s load up the first model which is trained on the BioNLP13CG Corpus — bionlp13cg_md
, and pass our text into the model.
nlp = en_ner_bionlp13cg_md.load()
doc_bionlp13cg = nlp(text)
We now have a document object that contains information about the entities.
Getting entities
Calling .ents
on it, we can see the entities it extracted.
doc_bionlp13cg.ents
(arylsulfatase A,
ARSA),
MLD,
arginine84,
arylsulfatase,
glutamine,
MLD,
arginine84,
glutamine,
ARSA,
ARSA,
individuals,
patients,
MLD,
ARSA,
MLD)
Visualizing entities with labels
We can even have spaCy visualize the entities and the labels right on our title+abstract string. This is done with the displacy
function
displacy.render(doc_bionlp13cg_md, jupyter=True, style='ent')
Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.We report on a new allele at the arylsulfatase A GENE_OR_GENE_PRODUCT ( ARSA) GENE_OR_GENE_PRODUCT locus causing late-onset metachromatic leukodystrophy ( MLD CANCER ). In that allele arginine84 GENE_OR_GENE_PRODUCT , a residue that is highly conserved in the arylsulfatase GENE_OR_GENE_PRODUCT gene family, is replaced by glutamine AMINO_ACID . In contrast to alleles that cause early-onset MLD GENE_OR_GENE_PRODUCT , the arginine84 GENE_OR_GENE_PRODUCT to glutamine AMINO_ACID substitution is associated with some residual ARSA GENE_OR_GENE_PRODUCT activity. A comparison of genotypes, ARSA GENE_OR_GENE_PRODUCT activities, and clinical data on 4 individuals ORGANISM carrying the allele of 81 patients ORGANISM with MLD GENE_OR_GENE_PRODUCT examined, further validates the concept that different degrees of residual ARSA GENE_OR_GENE_PRODUCT activity are the basis of phenotypical variation in MLD GENE_OR_GENE_PRODUCT ..
The document entities also has the attributes text
, label_
, start_char
, and end_char
which are important information we need for this challenge.
pprint({(X.text, X.label_, X.start_char, X.end_char) for X in doc_bionlp13cg_md.ents})
{('ARSA', 'GENE_OR_GENE_PRODUCT', 441, 445),
('ARSA', 'GENE_OR_GENE_PRODUCT', 483, 487),
('ARSA', 'GENE_OR_GENE_PRODUCT', 654, 658),
('ARSA)', 'GENE_OR_GENE_PRODUCT', 127, 132),
('MLD', 'CANCER', 188, 191),
('MLD', 'GENE_OR_GENE_PRODUCT', 362, 365),
('MLD', 'GENE_OR_GENE_PRODUCT', 575, 578),
('MLD', 'GENE_OR_GENE_PRODUCT', 711, 714),
('arginine84', 'GENE_OR_GENE_PRODUCT', 209, 219),
('arginine84', 'GENE_OR_GENE_PRODUCT', 371, 381),
('arylsulfatase', 'GENE_OR_GENE_PRODUCT', 263, 276),
('arylsulfatase A', 'GENE_OR_GENE_PRODUCT', 110, 125),
('glutamine', 'AMINO_ACID', 305, 314),
('glutamine', 'AMINO_ACID', 385, 394),
('individuals', 'ORGANISM', 523, 534),
('patients', 'ORGANISM', 561, 569)}
And voila! You’ve extracted entities from a biomedical paper title and abstract with a pretrained NER model from scispaCy!
Now let’s see what entities the other 3 models extracts.
nlp_craft = en_ner_craft_md.load()
nlp_jnlpba = en_ner_jnlpba_md.load()
nlp_bc5cdr = en_ner_bc5cdr_md.load()
doc_craft = nlp_craft(text)
doc_jnlpba = nlp_jnlpba(text)
doc_bc5cdr= nlp_bc5cdr(text)
CRAFT Corpus model
displacy.render(doc_craft, jupyter=True, style="ent")
Late-onset metachromatic leukodystrophy: molecular CHEBI pathology in two siblings.We report on a new allele SO at the arylsulfatase A GGP (ARSA) locus SO causing late-onset metachromatic leukodystrophy (MLD). In that allele SO arginine84, a residue that is highly conserved SO in the arylsulfatase gene GGP family, is replaced SO by glutamine CHEBI . In contrast to alleles SO that cause early-onset MLD, the arginine84 to glutamine substitution CHEBI is associated with some residual ARSA activity. A comparison of genotypes SO , ARSA GGP activities, and clinical data on 4 individuals TAXON carrying the allele SO of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD..
JNLPBA corpus model
displacy.render(doc_jnlpba, jupyter=True, style='ent')
Late-onset metachromatic leukodystrophy: molecular pathology in two siblings.We report on a new allele DNA at the arylsulfatase A (ARSA) locus DNA causing late-onset metachromatic leukodystrophy (MLD). In that allele arginine84 DNA , a residue that is highly conserved in the arylsulfatase gene family DNA , is replaced by glutamine. In contrast to alleles that cause early-onset MLD, the arginine84 PROTEIN to glutamine substitution is associated with some residual ARSA PROTEIN activity. A comparison of genotypes, ARSA PROTEIN activities, and clinical data on 4 individuals carrying the allele DNA of 81 patients with MLD examined, further validates the concept that different degrees of residual ARSA PROTEIN activity are the basis of phenotypical variation in MLD..
BC5CDR corpus model
displacy.render(doc_bc5cdr, jupyter=True, style='ent')
Late-onset metachromatic leukodystrophy DISEASE : molecular pathology in two siblings.We report on a new allele at the arylsulfatase A (ARSA) locus causing late-onset metachromatic leukodystrophy DISEASE ( MLD DISEASE ). In that allele arginine84, a residue that is highly conserved in the arylsulfatase gene family, is replaced by glutamine CHEMICAL . In contrast to alleles that cause early-onset MLD DISEASE , the arginine84 to glutamine CHEMICAL substitution is associated with some residual ARSA activity. A comparison of genotypes, ARSA activities, and clinical data on 4 individuals carrying the allele of 81 patients with MLD DISEASE examined, further validates the concept that different degrees of residual ARSA activity are the basis of phenotypical variation in MLD DISEASE ..
Now let’s take all these entity categorizations and put them all together.
data_doc_bionlp13cg = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_bionlp13cg.ents]
data_doc_jcraft = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_jcraft.ents]
data_doc_bc5cdr = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_bc5cdr.ents]
data_doc_jnlpba = [(X.text, X.label_, X.start_char, X.end_char) for X in doc_jnlpba.ents]
data = data_doc_bionlp13cg + data_doc_jcraft + data_doc_bc5cdr + data_doc_jnlpba
Now let’s create a data frame for our combined extracted entities, with the text, label, starting character, and the ending character.
attrs = ["text", "label_", "start_char", "end_char"]
temp_df = pd.DataFrame(data, columns=attrs)
Then, we compare it with the entity type given in our training data.
temp_train_df = entities_train.query('abstract_id == 1353340')
temp_train_df.head()
id | abstract_id | offset_start | offset_finish | type | mention | entity_ids | |
---|---|---|---|---|---|---|---|
0 | 0 | 1353340 | 11 | 39 | DiseaseOrPhenotypicFeature | metachromatic leukodystrophy | D007966 |
1 | 1 | 1353340 | 111 | 126 | GeneOrGeneProduct | arylsulfatase A | 410 |
2 | 2 | 1353340 | 128 | 132 | GeneOrGeneProduct | ARSA | 410 |
3 | 3 | 1353340 | 159 | 187 | DiseaseOrPhenotypicFeature | metachromatic leukodystrophy | D007966 |
4 | 4 | 1353340 | 189 | 192 | DiseaseOrPhenotypicFeature | MLD | D007966 |
We can compare them side by side with an inner join between the two data frames.
merged_df = temp_df.merge(temp_train_df, how = 'inner', left_on ='text', right_on = 'mention')
merged_df[['text', 'label_', 'type']].drop_duplicates()
text | label_ | type | |
---|---|---|---|
0 | arylsulfatase A | GENE_OR_GENE_PRODUCT | GeneOrGeneProduct |
1 | arylsulfatase A | GGP | GeneOrGeneProduct |
2 | MLD | CANCER | DiseaseOrPhenotypicFeature |
6 | MLD | GENE_OR_GENE_PRODUCT | DiseaseOrPhenotypicFeature |
18 | MLD | DISEASE | DiseaseOrPhenotypicFeature |
34 | arginine84 | GENE_OR_GENE_PRODUCT | SequenceVariant |
36 | arginine84 | PROTEIN | SequenceVariant |
37 | arylsulfatase | GENE_OR_GENE_PRODUCT | GeneOrGeneProduct |
38 | ARSA | GENE_OR_GENE_PRODUCT | GeneOrGeneProduct |
50 | ARSA | GGP | GeneOrGeneProduct |
54 | ARSA | PROTEIN | GeneOrGeneProduct |
66 | patients | ORGANISM | OrganismTaxon |
As for determining which to map to, you need to go through the biolink model, along with the documentation of the four corpus.
For example, based on the CRAFT corpus article, these are what the entity types stand for
GGP
— GeneOrGenePhenotypeSO
— Sequence OntologyTAXON
— NCBI TaxonomyCHEBI
— Chemical Entities of Biological InterestGO
— Gene Ontology (biological process, cellular component, and molecular function)CL
— Cell Line
To change the values in the data, we can use the map
function from pandas.
temp_df.label_ = temp_df.label_.map(
{
"GENE_OR_GENE_PRODUCT": "GeneOrGeneProduct",
"GGP": "GeneOrGeneProduct",
"ORGANISM": "OrganismTaxon",
"CANCER": "DiseaseOrPhenotypicFeature",
"DISEASE": "DiseaseOrPhenotypicFeature",
"CHEBI": "ChemicalEntity",
"CHEMICAL": "ChemicalEntity",
"PROTEIN": "SequenceVariant",
"AMINO_ACID": "SequenceVariant",
"SO": "SO",
"TAXON": "TAXON",
"DNA": "DNA",
}
)
Note: SO, TAXON, and DNA are mapped to the same value because map function requires all values to be mapped to something.
The end product is similar to what you need to submit to the competition, except it’s still missing id, abstract_id, and the column names need to be renamed.
temp_df.head()
text | label_ | start_char | end_char | |
---|---|---|---|---|
0 | arylsulfatase A | GeneOrGeneProduct | 110 | 125 |
1 | ARSA) | GeneOrGeneProduct | 127 | 132 |
2 | MLD | DiseaseOrPhenotypicFeature | 188 | 191 |
3 | arginine84 | GeneOrGeneProduct | 209 | 219 |
4 | arylsulfatase | GeneOrGeneProduct | 263 | 276 |
The challenge now is to do that for the rest of the title and abstracts.
Conclusion
Note that this article introduced just one approach to this problem, and I do not advocate that it’s the best solution.
Here are a some ways to improve upon the given approach
- Use the data to tune the frequency of concepts per sentence/abstract
- Correct some misclassifications based on how some categories are done in the data and from reading the CORPUS and biolink documentation
- Use data to know which ontologies to search specifically
All the best in this competition and I’ll see you in the second phase!
Want to discuss about the challenge with other data scientists? Join the discord server!
Follow Bitgrit’s socials 📱 to stay updated on workshops and upcoming competitions!