A project developed as a platform to keep track of books that are being read or have been read previously. This allowed users to rate books they had read and allow other users to view those ratings. A future goal of the project was to use NLP and predictive modeling with the reviews and book attributes (genre, topics, subject) to predict books that a user would enjoy based on their past read books. As this was a greenfield project, the first task was to develop a dataset which could be used as a sample for a recommendation engine and use in NLP. For this reason, my group was tasked with developing a database and dataset to hold the book data. I created and maintained a PostgreSQL database on Amazon Web Services for this purpose. I also worked on sourcing and cleaning a 23 million row sample of book data. This required the consideration of time complexity and the processing times required to solve the problem as data as this scale would have taken an estimated 30 days to process. I refined and developed new processing methods that allowed the processing to take 5-6 hours in total.