This is the end. After a whirlwind semester of new tools coming at us at a breakneck speed, we have reached the end of exploring just the basics of our digital tool-belt. We learned how the computer, beyond just a machine for number manipulation, can do a lot of work in the work of the humanities. With this course’s emphasis on the command line, what we learn can be automated, and scripted to do an infinite amount of times. So, to sum up I want to go over each week’s tools, describe them, show how powerful they can be, and make a hypothetical problem that they could solve.
Virtual machines are computers within computers. What we worked on over the semester was a tool called HistoryCrawler, a virtual linux machine that was filled with Historian tools to play with.
Virtual machines are powerful for their consistency. You control every aspect of the machine doing your work, and that machine can be saved in case of a catastrophe, shared with others, or run multiple times at once. This makes for powerful automation projects.
Say, for instance I was doing some sort of complex analysis of a huge dataset. It takes several weeks of hard work, but eventually I do complete it. I go on to use my data in a publication, but while presenting it at a conference, I am called out on my methodology. With a virtual machine, months later I could return to the project, in the exact state it was when I finished work, and see exactly how I did everything I did, no matter what computer I currently own. Ass perpetually covered.
Basic Text Analysis
The week we worked on basic text analysis was really just a small dip into how we can use command lines to take large bodies of text and sift out important statistics such as word frequencies. What this opens up at the command line level is the ability to batch process large amounts of text. My current plan is using this to find connections in ideas between entire books with word frequencies.
If I have an entire library of, say some historical figure’s letters. If I have them as a txt file, I could use some basic tools that could be compiled into a simple script that distill out the most common words used, in order. With a single command I could process an entire catalog of letters this way. With that information, we can them look through these word frequencies, and find in a quick manner, the words that show up most frequently. Those that are more than common words can be used later on to link say people they are messaging to topics of writing, or even find developments of new terms or concepts.
Pattern Matching and Permuted Term Indexing
This is the lesson that expanded upon words and put us into the realm of phrases. We learned how to find words filtered out for the dictionary, making for strong keywords. We also learned phrase searching, which could be multiple words in a combination. Again this seems simple, but at the command line, done thousands, millions of times, it becomes a very powerful tool. My partner is actually working on a task that this could be very powerful in. She is often tasked with reading books and searching for specific terms. With automation, she could escape from the page by page reading that eats up unimaginable time and resources, and simply find them using egrep searches, with their contexts. Hell, you could even get the entire page it was written on.
Batch Downloading and Simple Search Engines
This lesson involved downloading large amounts of files from a site, using a url encoder, and one of my favourite toold, the search engine Swish-e. Swish-e is a tool to command line search a massive amount of text to find the files that contain them.
Batch downloading, from the sites that allow it, make the process of getting documents you want easier with certain sites. If you can download something in the command line, you can use a loop or script to download massive amounts of files for research. You then will need a way to find certain pages, or sections you are interested in. Swish-e after a burst can find the files that contain phrases or queries you want. While we did not explore it in the class, Swish-e has more feature heavy stuff, and can search multiple files for your term.
This tool featured in my scripting to do secondary source work. My plan is to use multiple digital tools to summarize, and tease out important arguments in a secondary source. After a reading of the bookends of the work, you can find keywords that show up often in the text, and like a powerful index you can find them in the text with swish-e. Better yet, if you are doing this with multiple pieces of writing, you can show different perspectives on the same idea, by seeing where it shows up across multiple works.
Named Entity Recognition
Named Entity Recognition is another one of my favourites of the course. This software scans through text to find and label the names, places and locations within it. You can then filter them out and use them in various ways. This tool is also part of my secondary source work, but has tonnes of uses. When trying to find keywords or tracing a story of some person, organization, or location throughout literature, you can do this. At the command line you can do it to entire libraries at a time.
Recently, I was writing a historiography of a historical figure’s relationship with a man. While there was no book on this directly, every biographer in one way or another addressed it. By using this system I was able to find in a vast collection of books the one’s in which he arrived, in order to parse out the ones I wanted to index for the search engine. Since his name could be spelled a few different ways, I trusted this approach better than a simple grep search.
Optical Character Recognition is really the engine that drives all of this. It is the software that takes scan of a piece of writing and converts it to text. I once spent an entire afternoon in the digital history lab scanning a book in order to show off its power. It does, however, have its drawbacks. The software isn’t perfect and so errors are a fact of life. However, there are fixes, and given the alternatives of transcription, you’re better off.
What we learned in this class was the process of breaking apart, and putting together PDF filed with command line tools. PDFs are an excellent way to transmit graphical, and textual information in a single file format. What automation can bring to this process is very much linked to other tools. We could take a series of scanned photos and stitch them together into a book file to use. We could break down a pdf into a series of photos for OCR and analysis.
Structured data is information compartmentalized in identifying rows and columns. Think an excel file, a spreadsheet. These can be used for multitudes of information gathering and analysis. The power of command line structured data involves a few important things. First, tallies and bits of information can be collected and organized from text in an automated manner. Second, in computer science, linear algebra words in structured data-like ways. In the realm of mathematics, we can go beyond 2 dimensional structured data, and move into many dimensions.
Say I am working on a project where I need to compare and contrast a few different pieces of writing in different ways. Say, we want to measure a thinker’s letters by length, length of words, and some other metrics that imply a greater mastery of the language they’re writing in. By measuring them in different axes on a multi-dimensional plane, you can put them in multi-dimensional space. Using that, you can then measure the multi-dimensional distance to see if there is indeed a trend. This is actually similar technology to how facial recognition software works.
XML Parsing and Graph Visualization
This week covered two fairly different topics. The first is XML parsing. XML is a format for texts, which works like HTML with tags, but unlike HTML the tags are defined by the creator of the text. This allows you to create tags like <author> or <ID number> or other stuff that can help make a piece of text incredibly more searchable and parsable for a machine. I can extract that data from hundreds of files to collect them. The second thing we worked on was graphical visualizations. Using a text file, that we can manipulate with command line tools, we can build connected graphics to create webs of connections, perfect for analysing influence.
The Spider works as a searcher of information for you. You can program it to have a checklist of things it searches for, and it can move through web-pages to gather them. This is often the tool Google uses to index the Internet for their search engine. Spiders can automate the task of fishing through large collections of files and website to gather small bits of important information. It can store them in structured hierarchies that you can determine. One good example of this is the use of a spider to attach twitter users to their followers, to find networks of users following the same account.
The spider will likely play a large role in my own research. The web is an infinitely vast place with endless amounts of data. To structure that for research I am doing, the simple looking through page on page can become a very tedious task. A spider set to look for particular things could be set loose on a website and find every single piece of data I am looking for and store it. Fox News search anyone?
Scrivener is a bit of an extracurricular tool I began using. The tool I am writing this on right now. It’s a word processor that can take an entire projects research, structure, and words into a single project file. It can break down writing to its smallest components and make first drafts much less daunting. I haven’t even begun to imagine the kinds of tools it has lurking around when I have time to play with it.
I will refer to an essay I wrote last week. Normally, I am fighting with the minimum word count for a paper. On this one I felt as if I was as well, but lo and behold, writing section on section made my ideas flesh out much better. I felt behind the entire time I was working on it, but the finished product was much better in organization and logic than anything I had written before. Much pride. Many words. Wow.
To zoom out, this new tool belt has opened a lot of doors for historians. As a historian of the 21st century, these tools are not only helpful, but necessary to weed through unmanageable masses of primary source data. It will play a large part of my current work, and hopefully my future studies as well.