View on GitHub

Digital Humanities Programming Pedagogy in the Age of AI

DHSI 2025 Workshop with Anastasia Salter and John Murray

Exercise: Working Across Interfaces

Now, let’s combine the tools from Monday and get a better grasp on Python and its capabilities. In this exercise, we’ll revisit collecting, processing, and analyzing a data set of texts, but bringing Python scripts directly into play. We’ll primarily be making use of a few Python libraries:

You might find it helpful to look at documentation of these libraries, or even web scraping and distant reading tutorials in Python, for ideas of things to try. While you can install Python directly on your machine to complete these tasks, for this process we will continue using Google Colab.

Collecting Our Data

Our first script will enable us to automatically download a large set of texts in a particular category from Project Gutenberg. The sample prompts I used for this process are:

This sample downloads from the American Science Fiction subject. Modify the query to select a topic of interest to you with a multi-page result. You might need to debug further, or modify my prompts to reflect errors that arise in your own testing. When successful, you should see all the text files in a sub-folder in your Google Colab workspace.

Processing Our Data

Next, we’re going to run the stopwords and punctuation removal that caused challenges in our previous file processing. We’ll be able to access all the text files you’ve saved and run this as a batch. To avoid duplicating your previous steps, ask for a new scipt every time. I used the sample prompt:

This is a baseline - you can try further refining it with prompts to remove structured sections of content outside the main text. Use “add code” to add the new script - this way, you can run and debug it separately, as your downloaded files will remain in the folder for further work. Check the text files as you go.

Analyzing The Data

There are lots of options for analyzing across your texts: you can try using sentiment analysis or topic modeling, or you can try to use this data towards a creative project (like the markov chains we experimented with earlier). Notice how you can now work more reliably across a much larger dataset.

Here’s my prompts for creating a concordance for the individual texts as well as the full dataset:

Visualizing the Data

It’s possible to convert Python outputs and data visualizations to other formats: try asking for code to convert elements of your experiments to web-friendly output; switching between formates to get data that is ready for analysis in other software; etc. At this stage, you might also find it helpful to bring the outputted data back to the LLM for conversion - but always check the outputs.