Building a Search Engine for Electronic Thesis

Academic institutions the world over are known to produce hundreds of thousands of ETDs (Electronic Theses and Dissertations) every year. At the end of an academic year, we are left with large volumes of ETD data that are rarely used for further research or ever cited in future work, writings, or publications. As part of the CS5604: Information Storage and Retrieval graduate-level course at Virginia Tech, we collectively created a search engine for a collection of more than 500,000 ETDs from academic institutions in the United States, which constitutes the class-wide project.

This system enables users to ingest, pre-process, and store ETDs in a repository; apply deep learning models to perform topic modeling, text segmentation, chapter summarization, and classification, backed by a DevOps, user experience and integrations team. We are Team 1 or the “ETD Collection Management” team. During the course of the Fall 2022 semester at Virginia Tech, we were responsible for setting up the repository of ETDs, which encompasses broadly the following three components: (1) setting up a database, (2) storing digital objects in a file system, and (3) creating a knowledge graph. Our work enabled other teams to efficiently retrieve the stored ETD data, and perform appropriate pre-processing operations, and during the final few months of the semester, to apply the aforementioned deep learning models to the ETD collection we created. The key deliverable for Team 1 was to create an interactive user interface to perform CRUD operations (create, retrieve, update, and delete) in order to interact with the repository of ETDs, which is essentially an extrapolation of the work already taken up at Virginia Tech’s Digital Library Research Laboratory. Owing to the fact that the other teams had no direct access to the repository set up by us, we designed a host of Application Programming Interfaces (APIs) which are elaborated in depth in the subsequent sections of the report. The end goal for Team 1 was to be able to set up an accessible repository of ETDs so that they can be used for further research work. This is taking into account how each ETD is a well-curated resource and how it may even prove to be an excellent asset for an in-depth analysis on a certain topic, not limited to academic or research purposes.

Full Report here

Written on November 1, 2022