Back to blog

Intro to bitgrit AI-Generated Text Identification Challenge

Introduction

Data Science Competition platform bitgrit is going to launch its latest competition “AI-Generated Text Identification Challenge” on November 1st.

With the rise of “Generative AI” in the past year, it has caught the attention of both good and bad actors. This competition aims to tackle one of the issues Generative AI could create in the future (if not already here) as the competition lays out on its competition page:

– 

1. Misinformation and Fake News: AI-generated texts can be used to spread misinformation or generate fake news articles that appear authentic. This poses a significant risk to society as false information can quickly spread, leading to confusion, manipulation, and the erosion of trust in media and communication channels.

2. Bias and Discrimination: AI models are trained on vast amounts of data, which can reflect biases present in society. If not properly identified and addressed, AI-generated texts can perpetuate and amplify existing biases, including racial, gender, or social biases. This can lead to biased recommendations, unfair decision-making processes, or discriminatory content.

3. Lack of Accountability: As AI systems generate texts autonomously, it becomes challenging to hold someone accountable for the content produced. The absence of clear guidelines or oversight can allow malicious actors to exploit AI-generated texts for unethical purposes or illegal activities.

4. Restricted Privacy: AI models are built using large amounts of information, often the personal data of users. As the technology’s conversational ability improves, the inability to distinguish between human and machine puts the human user at risk of having their personal information leaked and used for unethical or malicious purposes i.e. sales marketing data, deep fakes, etc.

bitgrit wants to tackle these complex issues by identifying the AI-generated texts using the machine learning algorithm. Your task is to develop an algorithm to classify AI-generated text from human-generated texts.

– 

In this article, we will guide you through the datasets provided in the platform as well as the sample solution developed off the plain algorithm in python.

Dataset Overview

The folder you can download from our platform (be sure to register the account on bitgrit and hit the ‘participate’ button in the competition page!) called “datasets” contains the following files:

  • training_set.csv
  • test_set.csv
  • solution_format.csv

You can find the definitions of columns in the  “Data Breakdown” section of the website:

feature_0 ~ feature_767: word embeddings of the sentence. word_count: the number of words in the sentence. 

punc_num: the number of punctuations present in the sentence.

The word embeddings are created by machine learning models by analyzing large amounts of text data and learning the relationships between words based on their context. The resulting word embeddings can then be used to perform a variety of natural language processing tasks, such as sentiment analysis, machine translation, named entity recognition, and more.

By using word embeddings, machine learning models can better understand the meaning and similarity between words, even if they haven’t seen those specific word combinations before. This helps improve the performance of language-related tasks and enables machines to process and interpret human language more effectively.

Since the word embeddings are very complex to play around with (each tweet is represented by 768 dimensions!), we’re going to use this as is.

The number of words and punctuations are very straightforward and we can take a look quickly to see how that’s useful to predict whether the tweet is generated by an AI or a human.

Exploratory Data Analysis (EDA)

If you’ve ever thought about being a data scientist or taking on a role that touches data, you might have heard the term “EDA.”

Exploratory Data Analysis (EDA) is a crucial step in the model development process. It involves examining and summarizing the main characteristics, patterns, and relationships present in a dataset before applying any formal statistical techniques.

The goal of EDA is to gain a deeper understanding of the data and uncover insights that can guide further analysis or decision-making. Here are a few key aspects of EDA:

1. Data Summary: checking the number of rows and columns as well as missing values, identifying the types of variables (e.g., numeric, categorical), and understanding the overall structure of the data.

2. Descriptive Statistics: describe the distribution and variability of each variable in the dataset to find insights.

3. Data Visualization: visually explore the relationships and patterns within the data. Visualizations can help identify outliers, trends, clusters, and other interesting features that may not be evident in raw data.

So let’s use the above methodologies to see what the data looks like.

[Data Summary]

# Import packages
import pandas as pd
import xgboost as xgb
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv(‘Training.csv’)

# Check the data shape (number of rows and columns
df.shape

# Check the distribution of values in "ind"
df['ind'].value_counts(normalize=True)

# Check the first 5 rows of the data
df.head()

[Descriptive Statistics]

# Remove word embeddings columns and see the statistics
df_no_text = df[['ID', 'word_count', 'punc_num', 'ind']]
df_no_text.info()

# Check if there's any difference between AI generated tweets and human generated tweets.

# word count

ai_text_word_count = df_no_text.loc[df['ind'] == 1, ['word_count']].describe()
human_text_word_count = df_no_text.loc[df['ind'] == 0, ['word_count']].describe()

stats_word_count = pd.concat([ai_text_word_count, human_text_word_count], axis=1)
stats_word_count.columns = ['AI', 'Human']
stats_word_count

# punctuation count

ai_text_punc_count = df_no_text.loc[df['ind'] == 1, ['punc_num']].describe()
human_text_punc_count = df_no_text.loc[df['ind'] == 0, ['punc_num']].describe()
stats_punc_count = pd.concat([ai_text_punc_count, human_text_punc_count], axis=1)
stats_punc_count.columns = ['AI', 'Human']
stats_punc_count

[Data Visualization]

# Percentage plot side by side
num_bins = 100
range_val = (0,100)

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
axs[0].hist(ai_text_punc_count, bins=num_bins, range=range_val, alpha=0.8,  color='#FA62FF', edgecolor='None', density=True, label='AI Generated')
axs[0].hist(human_text_punc_count, bins=num_bins, range=range_val, alpha=0.4,  color='grey', edgecolor='None', density=True, label='Human Generated')
axs[1].hist(ai_text_word_count, bins=num_bins, range=range_val, alpha=0.8, color='#FA62FF', edgecolor='None', density=True, label='AI Generated')
axs[1].hist(human_text_word_count, bins=num_bins, range=range_val, alpha=0.4, color='grey', edgecolor='None', density=True, label='Human Generated')
# axs[0].set_facecolor('#96FFFF')
# axs[1].set_facecolor('#96FFFF')
axs[0].set_xlabel('Count')
axs[0].set_ylabel('Frequency')
axs[0].set_title('Word Count')
axs[0].legend(loc='upper right')
axs[1].set_xlabel('Count')
# axs[1].set_ylabel('Frequency')
axs[1].set_title('Punctuation Count')
axs[1].legend(loc='upper right')
axs[0].grid(False)
axs[1].grid(False)
plt.tight_layout()

It’s evident that two distributions (AI vs human) are different. We could process the data (i.e. scale, standardize, etc.) so the distinction between two variables becomes more clear and yields a better model performance. However, for the main purpose of this article we’re going to use them as is just like the word embeddings.

Model Development

In this model development, we’re going to use Logistic Regression to see how well the plain and simple yet powerful model can perform on this data. 

For word embeddings, we want to reduce the number of dimensions so our simple algorithm can handle it. We use Principal Component Analysis to reduce the number of variables from 768 to 100.

# Get the column names starting with 'feature_'
feature_columns = [col for col in df.columns if col.startswith('feature_')]
# Subtract the feature columns from the rest of the dataframe
df_subtracted = df[feature_columns]

# PCA on features.
from sklearn.decomposition import PCA
# Create an instance of the PCA class
pca = PCA(n_components=200)
# Perform PCA on the selected variables
pca_features = pca.fit_transform(df_subtracted)

pca_features_df = pd.DataFrame(pca_features)
pca_features_df.columns = pca_feature_list = ['feature_pca_' + str(i) for i in range(200)]

pca_features_df.head()

# Add PCAs back to the dataframe
df_pca_joined = pd.concat([df.drop(columns = feature_columns), pca_features_df], axis=1)

df_pca_joined.head()

Now let’s run the logistic regression model to our dataset.

# Import the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
import random
random.seed(1101)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_pca_joined.drop('ind', axis=1), df['ind'], test_size=0.2, random_state=42)
# Create an instance of the Logistic Regression model
logistic_regression = LogisticRegression(max_iter=1500)
# Fit the model to the training data
logistic_regression.fit(X_train, y_train)
# Make predictions on the test data
y_pred = logistic_regression.predict(X_test)
# Evaluate the performance using the F1 score
f1score = f1_score(y_test, y_pred)
# Print the F1 score
print("F1 Score:", f1score)

Even with this simple model, you get F1 Score: 0.61, which is better than random prediction (F1 Score was 0.15 in our case). At this point, you can choose to run the model on the test set and see what score you’d get on the platform or fine-tune the model to get a higher F1 score. Below is the example of generating the submission file for bitgrit

# Load the test data
test = pd.read_csv('test_set.csv')
test.head()

Now apply the PCA model you just created based on the training set.

# Perform PCA on the selected variables
feature_columns = [col for col in test.columns if col.startswith('feature_')]
# Subtract the feature columns from the rest of the dataframe
test_subtracted = test[feature_columns]
pca_features_test = pca.fit_transform(test_subtracted)

# Conver the numpy array into dataframe and combine it with the original dataframe
pca_features_test_df = pd.DataFrame(pca_features_test)
pca_features_test_df.columns = ['feature_pca_' + str(i) for i in range(200)]

# Add PCAs back to the dataframe
test_pca_joined = pd.concat([test.drop(columns = feature_columns), pca_features_test_df], axis=1)
test_pca_joined.head()

We’re ready to run the model on the test data.

# Scoring
prediction = logistic_regression.predict(test_pca_joined)

# Check the predictions
from collections import Counter
Counter(prediction)

Create the submission file.

# First let's add "ind" column to the test data with the predictions.
test['ind'] = prediction

# Subset the "ID" and "ind" columns so the file structure is the same as 'solution_format.csv'
solution = test[['ID', 'ind']]

# Check the format
format = pd.read_csv('solution_format.csv')
format.head()
solution.head()

# Save the solution in csv format
solution.to_csv('solution.csv', index = False)

Conclusion

As you can see, the overall process from data analysis to model development  is very straightforward especially with the dataset being extremely clean (i.e. no empty values). 

Now it is your turn. Please feel free to explore other ways to analyze the dataset, preprocess the data and try out more solid algorithms to improve the score. Please note that you can only submit the file 3 times a day on the platform. We’d advise that you submit the file only when you think your predictions are good enough.

Register here and see our competitions. We look forward to your participation!