Generating Algorithmic Poetry

posted in: Happenings | 0

In this post, I’m sharing the python code that Sven Anderson and I used in our “Family Weekend” class for “Technologies of Reading: Human and Machine Approaches to Literature.” We began the class with two poems from an NPR article called, “Human or Machine: Can you tell who wrote these poems?“, which presents a series of sonnets and asks readers to distinguish between ones written by people and ones generated by computers. For our class, this challenge offered an occasion to think about what might be entailed in attempting to program poetic style, and it helped to reveal what even far more sophisticated algorithms than the one we share here have failed to achieve in imitating human poetry.

The question of what constitutes poetic style has and will continue to produce varied and complicated answers, but, at bottom, it comes down to choice. This experiment in generating “poetry” represents one way of simulating this process of choice by randomly picking words according to their frequency in a particular poet’s corpus. (Just to note, some of the functions in this script are in a separate file (functions.py) to make it easier to read. You can access those functions and the accompanying text files here.

Objective: We will be able to articulate specific features of poetic writing that can distinguish computer-generated poems from ones written by people.

In this class, we’ll be working with python code in the Jupyter notebook interface. What’s great about this platform is that we can easily move between formatted text, code, images, and pretty much any other type of digital object.

word-cloud1
A word cloud generated from Abraham Lincoln’s Gettysburg Address. The size of the words correspond to their frequency in the speech. Source: http://marketingworks360.com/2012/03/marketing-tips/visualizing-quantitative-data-a-lesson-in-identifying-untapped-insight/.

We will use this code to produce a randomly-generated poem based on a particular poet’s style.

How might we mathematically represent authorial choice?

We can start by attempting to simulate the process by which particular authors make stylistic choices — for instance, how they choose which words to use.

In programming, we produce simulations using algorithms, which are sets of rules for the computer to follow. They are computational recipes for turning a given input into an output. Here we create an algorithm in a function for calculating the tip for a meal.

def tipCalculator (mealTotal): ### Our "mealTotal" is our input. return mealTotal * 0.20 + mealTotal ### We add 20% of the total to the input to get our output

We’ve created the function, now we can run it:

print(tipCalculator(50.00))
60.0

Now let’s think about how we might simulate poetic word choice to produce randomly generated poetry.

# Load some functionality and some pre-computed probabilities. import functions import pickle dickinsonProbs = pickle.load(open('dickinsonProbs.p', 'rb'))

print(dickinsonProbs['the']) print(dickinsonProbs['heart']) 0.06506088945704097 0.0009419363520150709

These numbers tell us the likelihood of picking each word if we were to draw them at random from the text: the more frequent the word in the text, the greater chance of choosing it.

from random import random def randomWord(probs): '''Given probs, a dictionary of word probabilities, this returns a word according to how frequently that word is found in the dictionary.''' rnum = random() sumprob = 0.0 for k in probs.keys(): sumprob += probs[k] if sumprob > rnum: return k return k

We can simulate this process by using a function that randomly chooses words from Dickinson’s poetry, based on her word usage.

print(randomWord(dickinsonProbs)) drum

for i in range(10): print(randomWord(dickinsonProbs)) since but them most few sinks one repeal the delight

This frequency-based approach gives us a way to simulate the poet’s process of choosing words. But after we choose a random word what comes next?But after we choose a random word what comes next? Our approach (and the approach of most text generators, for that matter) draws upon the statistical features of language, which means that the question of what comes next can be answered by considering what came before. It’s tempting to think that word sequences are consequences of grammatical rules of language, but, in practice, such rules are commonly ignored and defied. We can get a better model of language by observing the frequency of particular word combinations in actual usage. In The Information (2011), James Gleick explains how this view emerged from Claude Shannon’s mathematical theory of information. From the unit of the character to phrases to full passages, preceding items can be used to predict a range of possible items that might follow:

Immediately after the word yellow, some words have a higher probability than usual and others virtually zero. After the word an, words beginning with consonants becoming exceedingly rare…And the structure can extend over long distances: in a message containing the word cow, even after many other characters intervene, the word cow is relatively likely to occur again. As is the word horse. A message, as Shannon saw, can behave like a dynamical system whose future is conditioned by its past history. (226)

From this perspective, modeling language might be considered akin to modeling weather or some other natural phenomena, and we can build the implications of this approach into our poetry generator. We can choose the next word based on which words are likely to follow the first word, according to the poet’s usage.

dickinsonBigrams = pickle.load(open('dickinsonBigrams.p', 'rb')) shakespeareProbs = pickle.load(open('shakeProbs.p', 'rb')) shakespeareBigrams = pickle.load(open('shakeBigrams.p', 'rb')) print(shakespeareBigrams['compare']) {'thou': 0.3333333333333333, 'with': 0.16666666666666666, 'myself': 0.16666666666666666, 'thee': 0.16666666666666666, 'them': 0.16666666666666666}

In Shakespeare’s Sonnets, *thou* is the most likely word to follow *compare*. With this process of randomly choosing words likely to follow one another we can build entire poems. In the following algorithm, we combine this process with a method of making the poems rhyme by matching the last word of every odd line with a random rhyming word in every even line.

def generatePoem(probs, bigrams, lineLength, poemLength): poemLines = [] # create poemLines from probabilities for i in range(poemLength): line = functions.generateFromBigrams(probs, bigrams, lineLength) poemLines.append(line) lineCount = 1 newLines = [] for line in poemLines: if lineCount % 2 == 1: # an odd line sets the rhyme rhymedLine = line newLines.append(rhymedLine) lineCount += 1 elif lineCount % 2 == 0: # an even line must match rhyme of rhymingLine = line # preceding line try: if rhymedLine[-1] in functions.pronounce.keys(): newWord = functions.rhyme(rhymedLine[-1], 3)[0] rhymingLine[-1] = newWord # fix the rhyme to match newLines.append(rhymingLine) lineCount += 1 except: newLines.append(rhymingLine) lineCount += 1 # Now we concatenate to a single string. fullPoem = [] for line in newLines: line.append('\n') fullLine = ' '.join(word for word in line) fullPoem.append(fullLine) return ''.join(line for line in fullPoem)

Now we can generate our own “poems” from the word choice of either Shakespeare or Emily Dickinson.

Generate Shakespearean poetry

The generatePoem function takes the poet’s word probabilities, the bigram probabilities, and numbers for the line length and number of lines.

print(generatePoem(shakespeareProbs, shakespeareBigrams, 8, 14 ))

praises worse than at that i not so affined
that our time and eyes have might to give
in my bosoms ward but let me do forgive
yore those children nursed deliverd from serving with thee
which in the strength and they look what thee
from the first the morning have been mine eye
thy fair assistance in loves fire shall in ai
yet the perfumed tincture of my desire keep open
they see barren tender feeling but yet eyes alagappan
doubting the face should transport me that is kind
enforced to my deepest sense to his brief affined
and gives thee back the past i have often
not my love thy store when days when acetaminophen

The Shakespearean “poem” begins with praises as the randomly chosen word, weighted by the word frequencies in the sonnets. Each succeeding word then is chosen according to the likelihood of following the preceding one. This doesn’t mean, though, that each word is the most likely to follow the previous one; rather, like the first word, it is selected randomly with the weighted frequencies factored in. This is why running the generatePoem function will produce new lines each time. The resulting poem-like thing resembles English grammar, but most lines don’t make sense, or at least require a form of creative reading. Additionally, the rhyming words that appear at the end of the even-numbered lines come from outside of Shakespeare’s sonnets, giving the poem a bit of a mad-libs quality — I actually do like the concluding couplet.

Generate Dickinsonian Poery

print(generatePoem(dickinsonProbs, dickinsonBigrams, 8, 6))

for heaven she had come nor simplified her
consummate plush how cold i had the her
off for pearl then of love is finished
you lost was there came out time abolished
held but internal evidence gives us is overcome
he stayed away upon the merchant smiled branscome

Privacy Performed at Scale

posted in: Happenings | 0

You wear a Fitbit during your jog in the morning, swipe your grocery loyalty card when you checkout in the evening, and share your contacts when you play Candy Crush at night. This “Internet of Things” offers automation, customization, and convenience and, in turn, demands access. The daily choices to allow access or not repeatedly require you to produce the contours of your private life, to open certain spaces and keep others closed. In this collaborative potluck, we asked: how do those contours take shape? (If your first question, is what the heck is a critical potluck? In short, it’s a workshop-like collaborative event; this one was held on September 15 at Bard. Check our handy guide for more details.)

We began with an object of inquiry: Part 4 of the Pew Research Center’s Internet Survey. It was conducted on 461 participants between Jan. 27 and Feb. 16, 2015. The survey offers a snapshot of how people face the ever-growing number of technologies and products that make claims upon various types of private information (from your Google search history to your average driving speed). Pew released the survey data with summary analysis that provides a glimpse of the general sentiments of respondents. (Hover over the pie charts to see the corresponding questions.)

The aggregated responses illustrate that, for instance, most people are more wary of potential encroachments to their homes than their workplaces. The critical potluck, however, was intended to interrogate more subtle patterns that might be observed in how respondents conceived of that fuzzy line between acceptable and creepy. The task was to pose a question to the survey data designed to generate results that reveal points of tension in how survey respondents constitute their private lives. Consider the great number of scholars who have examined the boundary between public and private—Michel Foucault, Jurgen Habermas, Nancy Fraser, Michael Warner, and so on. Many of these writers have discussed the relationship between public and private in terms of sites, spaces, or spheres. In The Human Condition (1958), Hannah Arendt claims, “the four walls of one’s private property offer the only reliable hiding place from the common public world, not only from everything that goes on in it but also from its very publicity, from being seen and being heard.”1 Private spaces are homes, glove boxes, and safe deposit boxes. What happens to the language and concepts of privacy when it is represented in 1s and 0s and distributed across databases and server farms?

The potluck discussion began with this familiar idea of a private space: Is it strange that a smart thermostat is so much more worrisome than workplace cameras that use facial recognition algorithms? But as we worked through more of the responses, participants became more interested in a peculiar incongruence in the charged, aggressive manner that respondents discussed such fuzzy, abstract ideas, like their data, information, and even stuff.

Rather than just look at specific lexical patterns, though, we wanted to track what kinds of words and constructions were used to describe emergent notions of privacy. We ended up turning to Docuscope, a dictionary of over a hundred abstract categories (what the designers call Language Action Types, or LATs) that can classify over 40 million linguistic patterns according to their rhetorical usage. Docuscope was designed by Carnegie Mellon rhetoricians, David Kauffer and Suguru Ishizaki, and they describe it as a tool for, “corpus-based rhetorical analysis.” We uploaded the survey responses to the web-based tool, Ubiqu+Ity, which tags the text according to the Docuscope categories. Here’s how one tagged response looks:

tagged response

The interface of tagged responses provided more precise vocabulary for describing the relationship between exclamatory adjectives and vague nouns that seemed to recur in the survey answers. Abstract concepts is the most frequent LAT in the data, and Reporting States, Negativity, and Deny and Disclaim are also very prevalent. In this example contingent verb constructions shape how the respondent relates to abstract ideas of “information” and “basic activities.” The respondent considers the different variables at play that would determine whether being monitored via video at work would be acceptable or not.

When the scale of analysis is shifted from individual responses to the LATs in the larger corpus, it appears that the Contingency category best explains the variation among the answers. The graph below is a plot of all 1085 text-box responses to the survey. The points are positioned according the frequencies of every LAT categories in each response. This means that each data point initially has over a hundred dimensions in which it could be plotted, but, of course, there’s no way to present this virtual space and make sense of it. Instead, I’ve reduced the space to two dimensions using the statistical technique of Principal Component Analysis, which produce two new composite axes (made up of a combination of the original 100+ axes) that represent the greatest spread of the data, the most variation in values along one axis and then the axis that is orthogonal to the first axis. These are the principal components.

pca1

By examining the LAT categories that have the greatest ‘pull’ on the data, we can observe that, along the x-axis, the presence of Contingency language category seems to explain that most difference between responses. Points positioned to the left of the zero on the x-axis refer to answers that have virtually no words or phrases tagged as Contingency, while most of the points to the right, especially in the blue cluster have a relatively high frequency of contingency-tagged terms. Below are representative responses for different areas of the graph, and you can look at the other LAT categories that contribute to the variation between responses here. For the rest of this brief analysis, though, I want to consider how contingency operates in the survey answers. This result raises a new question: What’s at stake in contingently conceiving the contours of one’s private life? The responses on the right side of the graph indicate participants imaginatively delimiting what ought to be private.

pca2

The Contingency tag often refers to responses written in the subjunctive mood, which, as the Oxford English Dictionary reminds us is a grammatical form used to “denote an action or a state as conceived (and not as a fact) and therefore used to express a wish, command, exhortation, or a contingent, hypothetical, or prospective event.” (I like this definition because it acknowledges the use of the subjunctive for both prospective situations and command, which gestures to the angry frustration exhibited in many of these responses.) Psychologist Jerome Bruner observed, “To be in the subjunctive mode is to be trafficking in human possibilities rather than in settled certainties.”2 Contingency in the responses tends to correlate with Abstract Concepts and Reporting Events, creating a profile that is similar to the tagged example included above. Here are other representative answers:

  • “I feel like this information would be used against me if I got into an accident. For example, the insurance agency will look at the data and say that I was speeding right before the accident and claim that the accident was my fault. I also feel as though law enforcement would attempt to use this information in the event of a criminal case.
  • “If they just tracked employee attendance and performance for statistical purposes I would be fine with it. But if they used that footage to punish employees that show up late/goof off then I would not be okay with it.”

What is interesting about using the presence of contingent language for separating responses is the fact that taking a suppositional stance turns out not to be very popular perspective, despite the fact that almost all of the questions are phrased in the subjunctive mood. Sixty-eight percent of the respondents did not use any contingent language according to the Docuscope tagger. In the survey, the more common stance is a flat out denial of any demands upon one’s privacy (though this seems unlikely to be as common in real-world decisions). Contingent answers, by contrast, exhibit a willingness to play along. The Pew Research Center administered this survey digitally, with participants recording their answers at computer terminals. Indeed, responses even begin to suspect the survey itself as a privacy violation at times.

“I do not want to give any answer to this.”

This arrangement is fitting since the method of the survey conveniently echoes the typical media interface in which commercial demands are made upon privacy – the screen. Media historian Lisa Gitelman links the plane of the screen to what Walter Benjamin calls the ‘dictatorial perpendicular’ of modern reading.3 Locating the screen in its natural post-capitalist habitat, Gitelman explains that “the office walls contain cubicles, the cubicles contain screens, the screens contain windows, and the windows contain page images. These vertical surfaces nest within each other, interfacing like a sequence of Russian dolls, waiting to funnel attention toward documents as if their very perpendicular sequence could ward off distraction” (130).

Even outside of the workplace, screens prescribe conditions of attention and use that proliferate the number and kinds of spaces in which private content might reside. Contingent responses to the Pew Survey index users grappling with the circumstances of new and vulnerable private spaces.

One response to a question regarding access to one’s phone camera and location:

“Because it is not a good idea. By giving anyone access to our camera and your location would mean they could activate your camera anytime without your permission. This means they could spy on you any time they wanted to without your knowledge. To me, this would be a huge invasion of privacy.”

In the emergent environment of digital media, privacy cannot properly be delimited to a physical space (although the vehement rejections of any type of home monitoring in the survey responses indicate that this circumstance has not been accepted calmly). Being connected creates a paradoxical condition for privacy. As media theorist Wendy Chun suggests, “Internet users are curiously inside out — they are framed as private subjects exposed in public.”4 Digital media is governed by a “logic of containment, which is always imagined as already transgressed” (13). The word imagined is key here. Private material, data, information is distributed, networked, and re-sold to the point that we can really only imagine the borders of our private lives. The concept of privacy might then be more accurately located in the context of performance, play, heuristic, or supposition, rather than in Arendt’s four walls. It is fluid and demands to be continually re-defined, re-constituted, and re-imagined.

With significant planning assistance from Gretta Tritch Roman and potluck contributions from Maria Cecire, Adhaar Desai, Miriam Felton-Dansky, and Natalie Desrosiers.

References

1.Hannah Arendt, The Human Condition, 2nd ed. (Chicago: University of Chicago, 1958), 71.

2. Lisa Gitelman, Paper Knowledge: Toward a Media History of Documents, (Durham, NC: Duke University Press, 2014), 129.

3. Jeremy Bruner, Actual Minds, Possible Worlds, (Cambridge: Harvard University Press, 1987), 26.

4. Wendy Hui Kyong Chun, Updating to Remain the Same: Habitual New Media, (Cambridge, MA: MIT Press, 2016), 12.

1 2 3 4