Uncovering Insights from the COVID-19 Open Research Dataset using Data Analytics

The COVID-19 pandemic had brought the world to a standstill, and researchers had been working tirelessly to find a cure for this infectious disease. With the rapid acceleration in coronavirus literature, it had become increasingly difficult for the medical research community to keep up. To address this issue, I developed an interactive COVID-19 literature clustering solution using t-SNE, PCA, and K-means algorithm to group similar research articles.


The COVID-19 Open Research Dataset (CORD-19) is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.


The interactive tool was able to group similar research articles, making it easier for researchers to discover relevant literature and insights. The solution was also able to identify research articles that were highly cited and popular, indicating their importance in the field.

Link to Jupyter Notebook

Written on October 15, 2022