Wrapping Up

This is the end. After a whirlwind semester of new tools coming at us at a breakneck speed, we have reached the end of exploring just the basics of our digital tool-belt. We learned how the computer, beyond just a machine for number manipulation, can do a lot of work in the work of the humanities. With this course’s emphasis on the command line, what we learn can be automated, and scripted to do an infinite amount of times. So, to sum up I want to go over each week’s tools, describe them, show how powerful they can be, and make a hypothetical problem that they could solve.

Virtual Machines

Virtual machines are computers within computers. What we worked on over the semester was a tool called HistoryCrawler, a virtual linux machine that was filled with Historian tools to play with.

Virtual machines are powerful for their consistency. You control every aspect of the machine doing your work, and that machine can be saved in case of a catastrophe, shared with others, or run multiple times at once. This makes for powerful automation projects.

Say, for instance I was doing some sort of complex analysis of a huge dataset. It takes several weeks of hard work, but eventually I do complete it. I go on to use my data in a publication, but while presenting it at a conference, I am called out on my methodology. With a virtual machine, months later I could return to the project, in the exact state it was when I finished work, and see exactly how I did everything I did, no matter what computer I currently own. Ass perpetually covered.

Basic Text Analysis

The week we worked on basic text analysis was really just a small dip into how we can use command lines to take large bodies of text and sift out important statistics such as word frequencies. What this opens up at the command line level is the ability to batch process large amounts of text. My current plan is using this to find connections in ideas between entire books with word frequencies.

If I have an entire library of, say some historical figure’s letters. If I have them as a txt file, I could use some basic tools that could be compiled into a simple script that distill out the most common words used, in order. With a single command I could process an entire catalog of letters this way. With that information, we can them look through these word frequencies, and find in a quick manner, the words that show up most frequently. Those that are more than common words can be used later on to link say people they are messaging to topics of writing, or even find developments of new terms or concepts.

Pattern Matching and Permuted Term Indexing

This is the lesson that expanded upon words and put us into the realm of phrases. We learned how to find words filtered out for the dictionary, making for strong keywords. We also learned phrase searching, which could be multiple words in a combination. Again this seems simple, but at the command line, done thousands, millions of times, it becomes a very powerful tool. My partner is actually working on a task that this could be very powerful in. She is often tasked with reading books and searching for specific terms. With automation, she could escape from the page by page reading that eats up unimaginable time and resources, and simply find them using egrep searches, with their contexts. Hell, you could even get the entire page it was written on.

Batch Downloading and Simple Search Engines

This lesson involved downloading large amounts of files from a site, using a url encoder, and one of my favourite toold, the search engine Swish-e. Swish-e is a tool to command line search a massive amount of text to find the files that contain them.

Batch downloading, from the sites that allow it, make the process of getting documents you want easier with certain sites. If you can download something in the command line, you can use a loop or script to download massive amounts of files for research. You then will need a way to find certain pages, or sections you are interested in. Swish-e after a burst can find the files that contain phrases or queries you want. While we did not explore it in the class, Swish-e has more feature heavy stuff, and can search multiple files for your term.

This tool featured in my scripting to do secondary source work. My plan is to use multiple digital tools to summarize, and tease out important arguments in a secondary source. After a reading of the bookends of the work, you can find keywords that show up often in the text, and like a powerful index you can find them in the text with swish-e. Better yet, if you are doing this with multiple pieces of writing, you can show different perspectives on the same idea, by seeing where it shows up across multiple works.

Named Entity Recognition

Named Entity Recognition is another one of my favourites of the course. This software scans through text to find and label the names, places and locations within it. You can then filter them out and use them in various ways. This tool is also part of my secondary source work, but has tonnes of uses. When trying to find keywords or tracing a story of some person, organization, or location throughout literature, you can do this. At the command line you can do it to entire libraries at a time.

Recently, I was writing a historiography of a historical figure’s relationship with a man. While there was no book on this directly, every biographer in one way or another addressed it. By using this system I was able to find in a vast collection of books the one’s in which he arrived, in order to parse out the ones I wanted to index for the search engine. Since his name could be spelled a few different ways, I trusted this approach better than a simple grep search.

OCR

Optical Character Recognition is really the engine that drives all of this. It is the software that takes scan of a piece of writing and converts it to text. I once spent an entire afternoon in the digital history lab scanning a book in order to show off its power. It does, however, have its drawbacks. The software isn’t perfect and so errors are a fact of life. However, there are fixes, and given the alternatives of transcription, you’re better off.

PDFs

What we learned in this class was the process of breaking apart, and putting together PDF filed with command line tools. PDFs are an excellent way to transmit graphical, and textual information in a single file format. What automation can bring to this process is very much linked to other tools. We could take a series of scanned photos and stitch them together into a book file to use. We could break down a pdf into a series of photos for OCR and analysis.

Structured Data

Structured data is information compartmentalized in identifying rows and columns. Think an excel file, a spreadsheet. These can be used for multitudes of information gathering and analysis. The power of command line structured data involves a few important things. First, tallies and bits of information can be collected and organized from text in an automated manner. Second, in computer science, linear algebra words in structured data-like ways. In the realm of mathematics, we can go beyond 2 dimensional structured data, and move into many dimensions.

Say I am working on a project where I need to compare and contrast a few different pieces of writing in different ways. Say, we want to measure a thinker’s letters by length, length of words, and some other metrics that imply a greater mastery of the language they’re writing in. By measuring them in different axes on a multi-dimensional plane, you can put them in multi-dimensional space. Using that, you can then measure the multi-dimensional distance to see if there is indeed a trend. This is actually similar technology to how facial recognition software works.

XML Parsing and Graph Visualization

This week covered two fairly different topics. The first is XML parsing. XML is a format for texts, which works like HTML with tags, but unlike HTML the tags are defined by the creator of the text. This allows you to create tags like <author> or <ID number> or other stuff that can help make a piece of text incredibly more searchable and parsable for a machine. I can extract that data from hundreds of files to collect them. The second thing we worked on was graphical visualizations. Using a text file, that we can manipulate with command line tools, we can build connected graphics to create webs of connections, perfect for analysing influence.

Spiders

The Spider works as a searcher of information for you. You can program it to have a checklist of things it searches for, and it can move through web-pages to gather them. This is often the tool Google uses to index the Internet for their search engine. Spiders can automate the task of fishing through large collections of files and website to gather small bits of important information. It can store them in structured hierarchies that you can determine. One good example of this is the use of a spider to attach twitter users to their followers, to find networks of users following the same account.

The spider will likely play a large role in my own research. The web is an infinitely vast place with endless amounts of data. To structure that for research I am doing, the simple looking through page on page can become a very tedious task. A spider set to look for particular things could be set loose on a website and find every single piece of data I am looking for and store it. Fox News search anyone?

Scrivener

Scrivener is a bit of an extracurricular tool I began using. The tool I am writing this on right now. It’s a word processor that can take an entire projects research, structure, and words into a single project file. It can break down writing to its smallest components and make first drafts much less daunting. I haven’t even begun to imagine the kinds of tools it has lurking around when I have time to play with it.

I will refer to an essay I wrote last week. Normally, I am fighting with the minimum word count for a paper. On this one I felt as if I was as well, but lo and behold, writing section on section made my ideas flesh out much better. I felt behind the entire time I was working on it, but the finished product was much better in organization and logic than anything I had written before. Much pride. Many words. Wow.

To zoom out, this new tool belt has opened a lot of doors for historians. As a historian of the 21st century, these tools are not only helpful, but necessary to weed through unmanageable masses of primary source data. It will play a large part of my current work, and hopefully my future studies as well.

Advertisements
Posted in Uncategorized | Leave a comment

Digital Historiography

Howdy readers, I know you guys probably want some update on my progress with my digital reading workflow, but work on my script has been slow, and when finished will get a blog post in and of itself. I am taking this post to wax and wane about digital tools and historiography.

For the uninitiated, historiography is the study of the writing of history.  The history of history, or historyception acts as a mirror of the society writing the history, and can teach us not only how people approached the past in their own time, but give us insight into how we approach the past as its own historical moment.

Over the last few months, I have come to understand techniques in digital history, and it has the potential for major tectonic shifts in doing history. What could this do to our historical writing however?

One of the first things I noticed is the increased weight on quantitative analyses of sources. Computers are by their nature number manipulators, so this does not surprise me. However, the weight of historical analysis through this could drift away from the finding of a few key sources, or word choices, and a bigger emphasis on things coming out of the multitude. While this is good, it does mean that a bit may be lost in the outliers.

If I were to also interject on something historiography would need with these more sophisticated techniques is methodology. Other fields have to explain the methods used to investigate their topic. While this would take away from the narrative of good historical writing, methodology is sophisticated to the point where critique and improvement could become a sub-discipline of itself.

Next time I promise to show you my fancy script, and show you where I am at in turning readings into something you can take to a seminar.

Posted in Uncategorized | Leave a comment

Automating Seminar Readings Part 2: Summaries, Scanners, and First Attempts

Hello there blogosphere, I know you have been waiting eagerly to find out how my automation attempts have worked. I have had some struggles, some successes, and some ideas for future iteration.

My first struggle came in the digitization process. Western has a great collection of resources available for digital humanities work, except for access to the quick scanning technology of a photocopier. This means that digitizing an entire book can be tricky. I managed to get my readings for one seminar scanned after a long goose chase around campus. The better part of a Tuesday gone, I resigned to the slow scanner we have in the lab.

I learned from Bill after discussing this with him that this could be improved via a number of means. One possible solution is to reconfigure the scanner to a lower resolution to increase scanning speed. But best idea of all was turning me on to the world of cell phone based scanning apps. For about 8 dollars I set myself up with the ability to scan (with OCR) documents using my iphone’s camera. There are several of these on the app store, and I imagine Android is equally stocked. So going onward, I can scan using this tool.

My second hurdle is also a function of time. At the moment I find myself typing a lot of command lines that I feel I could reduce the monotony of. My current process goes as following:

1. download all the .txt files from dropbox to individual folders for each reading
2. On each file run a word frequency count and take a list of keywords
3. Use the ner tool to create a list of keywords of names, places, and locations
4. These in hand, I copy all the notes into a new folder.
5. Burst all the files into a burst folder
6. Index the folder with Swish-e
7. Search for the keywords, look up the best results, and from the contexts put together an idea of what the articles are and how they interconnect

I did this via command lines, and so I feel like each of these jobs could be very quick with the right handful of scripts. I also had some problems with bursting files, and so may consider for next time using a size-based method of breaking up files rather than a line based method, since that resulted in some very ugly huge text files that reduced the utility of swish-e. I will keep working at it, as iteration is how great things are made.

Here is an example of one reading’s summary made using this. Combined with some googling, access to the readings text in class, and some good old fashioned context, you could do pretty well in a seminar. I hope so anyway, I will find out tomorrow.

Bennett – theoretical Issues

Keywords:

Women, status, brewsters, transformation, trade, brewing, continuity, medieval, reproduc*, prohibited, power, permission, patriarchy, feminist, university, patriarch*, institution*, gender, england
People: Kelly, Judith M. Bennett, Joann McNamara, Owlen Hufton, Caroline Walker Bynum
Organizations: Yale, Berkshire conference
Locations: England, London, New York, Europ*, United States,

Notes:
-brewing shows that jobs considered womens work would be coopted by men if too successful
-a bit of intersectionality with oppressive systems-was a good job for women’s time, but by 18th century they were pushed out
-Kelly, inverted synchronization-Periodization of womens history or traditional history invert on each other
-Histories of european women should not be the paradigm for women everywhere (We are doing another article on Kuhn so this line is applicable through that)

Until next time, over and out.

EDIT: I have thought about this project over the last day or so and I have come to an idea I will work on over the weekend. I want to make an all-in-one script for this process. With one script in a folder of .txt files I would like to:

  1. Make folders for each file
  2. Put a word frequency file into each file’s folder
  3. Run Stanford-ner on each file and export a person, location, and organization frequency file into each folder
  4. Create a folder for bursting documents
  5. burst each file into many small text files into the burst folder
  6. Get Swish-e to index the burst folder

All of this takes up a majority of the time I used to process readings. With this script I could do it in a matter of seconds, and expand it out to larger and larger sets, possibly even make it usable for something like comprehensive exams, or secondary source research for an area of scholarship.

Posted in Uncategorized | Leave a comment

Automating Seminar Readings Part 1: Setup and Surveying

One of the goals of History is to absorb massive amounts of data and distill out the important, corroborated, evidence and the solid ideas. In the way we have been taught in the class so far, this is done with an eye towards the primary source material. These tools, however, could serve as a strong method to accelerate seminar reading as well as increase comprehension. With your toolkit available in the seminar, you could use this even further.

Ethically I don’t see a problem with this. Embracing the cyborg really helps to build comprehension of massive amounts of text data our brains have a tough time batch processing anyway. Bringing tools to a class is not much different than bringing readings to class. Also, for students like myself who have learning disabilities like ADHD, this in a sense is a workaround for our limitations.

When we take our recently acquired tools into the mix, we have seen over the last few weeks the power of automation in achieving those tasks. Most of the way we were instructed involve selective reading of text. This involves introductions, conclusions, headlines, and special text. This series of blog posts will be my process of developing a seminar reading analysis and recall system.

1. Setup

This step is pretty straightforward. downloading assigned readings into a shared folder with your historycrawler is the essential first step. Some readings assigned might need to be scanned in order to do this. So part of your time for readings will be about going to a copy machine and scanning the book chapters into a digital format (really useful for those 2 hour loans).

What you should have is a collection of pdfs, images, and text files. The text files from say an article on a website or something copied into notepad is finished, you don’t need extra work on them. Images and pdfs however will need to be OCRed into a text format. We will be covering this in just a few days, but the tutorial is here. Make a backup of your readings and we then proceed.

2. Surveying

First we want to get a general view as what is going on in the readings. A simple way of doing this would be to break the readings up into pieces to easier sort through such as we did a few weeks ago. To get an idea of what all the reading converge on, we can do a word frequency analysis and a search for words not int he dictionary to find a series of important keyphrases. grep can even show these in context and source the files it’s from. With these tools you can piece together when major ideas come up in the readings that cross over and by reading the context around them you can see what each reading has to say on them to make interesting connections.

Where I see problems here that I can’t figure solutions to, it is in the tyranny of the majority in word frequencies among multiple files. If you were working with, for example, 2 journal articles and a book, the book will over-represent in the word frequencies and possibly drown out the journals. If there is a solution you can imagine (without establishing some form of source senate) just let me know in the comments. Iteration is king!

My next blog I will expand on this project and discuss ways we could put together a report on each of the readings to have a quick reference for the major parts of each document. Cheers!

Posted in Uncategorized | 1 Comment

Wikipedia and Text Mining

One idea for my upcoming cognate is to text mine the different iterations of the September 11th Wikipedia page to look for changes in language from its first article to present. Later in the course I am in, we plan on doing visualization, but for now I am simply planning how to find the important words within the document.

Last week I learned that while I could do a simple word frequency list, this still leaves me with a lot of filler words needed for English grammar. Useful for determining the language of the document, but it creates a lot of mess when trying to create this data set. I could monitor it by hand, but there has to be an easier way.

I would like to find something like the dictionary comparison we did last class, but for these types of words that I would not find useful. If such a thing exists, I would be very happy to have it. If not, then I have a whole new interesting project on my hands.

Lastly, I attempted to fix my little machine for an exploit named Shellshock. Sadly, I seem to still have issues where the console does not allow me to enter my password to download and install the update.

Posted in Uncategorized | 1 Comment

Text Mining, Social Memory, and Web 2.0

Currently, our work in a digital research methods course has been on the basics of text mining.  Specifically, we learned how to find word frequencies in txt files. I have had a little bit of problems with my history crawler and the shared document and and internet access, so I have been limited in my experimentation, but the possibilities are very tantalizing.

My research interests are in 21st century topics. Specifically, I have an interest in the War on Terror. My current project has to do with memorializing the september 11th attacks, and how Americans integrate it into their national narrative. My plan is to do this using a few archives that will need this type of text mining to operate. Here are a few ideas.

Bill turned me on to the internet archive’s collection of TV news. It has, since 2009 been recording 25 hour news channels, with text searchability from their close captioning files. Text mining this close captioning may point to information about word choice, and how over time and in different contexts, 9/11 changes its representation as history. With access I would like to try this with Lexis Nexis (An archive of text searchable news stories) too.

Other ideas came to mind when trying to use text mining for primary source work. One idea would be analysis of 9/11 anniversaries as they show up on the social media platform Twitter. I am not sure about access, but finding and mining popular 9/11 memorial hashtags would build some insight into 9/11 as history

Lastly, I would like to find the academic way to use wikipedia. The site has been around since 2001, and thus has been around for all the time that has passed since 9/11. Wikipedia keeps a record of change logs for every page, showing how the public editors have built and edited the page over the years. If I could find a way to text mine how this article has changed since the early days of reporting to today, there could be some really useful insight into the reconceptualizing of 9/11 from the day to present.

Text mining seems like a simple activity, but it is full of opportunity. These are just a few ideas off the top of my head. As I develop my skills as a researcher, maybe my imagination will unlock more. Text is powerful.

Posted in Uncategorized | 1 Comment

On Transitions and Expansions

On the internet, the newest competition is for attention. An internet user has only so much time per day to dedicate to consuming media online, and so in order to succeed, you must create a form of value that readers will find value from. With that in mind, I will attempt through this humble blog to create that value for the reader.

My name is Tristan Johnson, and I am a Historian. Breaking down that term over the last week of my second master’s degree has unveiled that Historian means many different skills and tasks that change constantly.

A historian is the shaper of humanity’s collective memory. We try to take the surviving data of our short lifetime of writing and synthesize it into a coherent origin story for the modern world. What does that mean exactly?

Teaching, writing, and researching make up the bulk of a historian’s day to day work. All of which are changing. This blog was created for a course that teaches the skills needed to take the latter into the future.

To any academic, research is known for its long and tedious processing and collecting of sources in order to find useful gems of information, or to collect tiny grains of sand to build your sand castle of a book/ journal article/ whatever. This course I am excited to be a part of is taking the tools of scripting and programming to automate some of these functions.

In another life, I wish I had been a computer scientist. Especially in the field of automation. In a perfect world, each time we can use a machine to replace the work of a person, the result should be a collective liberation from drudgery for mankind.

We are only one class in, and even with a cursory glance at the tools we’ll be using I can see the huge potential of incorporating automation to research. Given my interest in 21st century America, automation is not only helpful, but necessary for the sheer mass of sources I wish to work with. Mastery of these tools will change how I work completely.

I shall update you dear reader on my progress as my research moved along through my graduate school experiences, and I try to take on this new model of research. Hold on, things could get strange.

Posted in Uncategorized | Leave a comment