Graphical tools for text analysis

Further coverage in the latest issue of Etapes and the Upcoming Data flow 2.

Despite the considerable effort authors put into arranging the written word, the structure produced is inevitably a long line. The structure contained within this line is only usually revealed by a summary of content and an index. Contents point to subsections of the line that may have some similarity in subject and indices the positions of examples or similar topics within the whole line. This project explores the relationship between the two and structural forms within the text. We are investigating graphical tools that use dictionaries of the words contained within documents and the spatial position of the words to bring text documents out of the line and into an interactive format. The first product of this work is the...


(En)tangled Word Bank


Charles Darwin’s The Origin of Species contains many illustrious descriptive-phrases (such as ‘Tangled Bank’) that enabled powerful communication of his concept of ecological and evolutionary systems. These slogans did not always exist in their most well-known form, nor did they all exist from the beginning. In the case of ‘Tangled Bank’ – a famous description of ecological communities – which appears in the fifth & sixth editions (1869, 1872), Darwin made a choice to shorten the phrase from the “Entangled Bank” which occurred in the first to fourth editions (1859, 1860, 1861, 1866). Within the six editions of The Origin of Species, produced within Darwin’s life time, a variety of changes were made as the evolutionary ideas were framed within a changing scientific and cultural environment. This led to numerous changes in the books that we have visualised in the ‘(En)tangled Word Bank’.

We bring the books to life, as the textual code with all its insertions and deletions are brought out of line and into a ‘literary organism’ (see ‘Writing Without Words’ The literary organism has a structure defined by the division of words amongst the hierarchy of structures that make the book. Words form sentences, which are collected in paragraphs, forming the larger structures of subchapters and the chapters. Structural differences emerge from changes in the size of the text devoted to each component and the number of those components. The textual code defines this macroscopic structure and so structural changes occur as that code was altered, producing unique literary organisms for each edition.

We follow these temporal variations of The Origin of the Species’s form through 1859-1872 producing specimen plates to present these previously cryptic organisms. This presentation is akin to the earliest evolutionary biologists John Henslow – a crucial teacher of Darwin in his Cambridge days – and Ernst Haeckel, a vital supporter of Darwin’s work. Similarly to a botanical collection, the differences between the variations of a ‘type’ are illustrated by dissection, arrangement and exploration of important structures – from the whole organism, key generic structures (Sentences, Paragraphs, Subchapters, Chapters) and a comparative focus on comparable branches (first and last chapters).

By aging each structural component we can understand the temporal origins of each (see legend), and this codification shows the literary organism responding to the scientific, philosophical and cultural environmental change itself engineered. The sentences forming the ‘leaflets’ of the organism are of orange, senescent tones when they will be deleted in following editions. The green, growth tones are applied to those sentences that have life in the following edition. The tone of each colour is determined by its age, in editions, to that point. Through these differences in colouration the simplicity in structure in the early stages of the organism’s life develops into a complex form, showing when the structures developed to its changing environment. Around the organisms the textual code is provided, showing the changes in the size of the organism, and where the senescence and growth is derived in that code. A series of re-arrangements of the organism focus on changes at each level of organisation.

The (En)tangled Word Bank uses visual tools – a mode of communication close to Darwin’s heart – to engage new audiences and provide new insight into the ‘Origin’. Visual communication allows intuitive understanding of its development. Interestingly, like Darwin’s idea this work was not uniquely conceived. There is a ‘Wallace’ who unbeknownst to us was working on the exact same thing. A fortunate meeting with Ben Fry (Phyllotaxis Lab, was had in this anniversary year, revealing that both projects were working on visual analysis of The Origin of Species across these six editions, but with different ultimate aims. We are very grateful to for the use of their data and the support of Microsoft Research, Cambridge (turning ideas into reality,

Come see the (En)tangled Word Bank!

3rd - 20th July 2009

Cambridge University Centre, Main Dining Hall, Granta Place
Admission: free

This is a collaboration with STEFANIE POSAVEC.


Stefanie Posavec is a book cover designer with Penguin Books who explores at the interface of design, art and information communication. Stefanie’s previous text visualisation work on Jack Kerouac’s On the Road has been shown in multiple exhibitions and her new work, such as visualising Kraftwerk’s ‘Computer Love’ album, continue to inspire previously unseen beauty, new understanding and novel directions of enquiry in creative data sets. Born in Loveland, Colorado, Stefanie holds an MA in Communication Design from Central Saint Martins College of Art & Design.  


Coverage of the work on the 'Origins' Science blog.


See pictures from Darwin 2009 festival, Cambridge

... and the stage banners

Some pictures taken by Steve McInerny of the "(En)tangled Word Bank" at the Darwin Festival.

Some pictures of the laser etched "paragraph plates" we did for the royal academy summer show (we didn't get in mind!). (These have "posavec" diagrams for each paragraph in the first and sixth editions. If the paragraph survives from first to sixth edition then it has a square around it. Shows how much was 'retouched' in the 13 years between these editions. This was a prototype for a "darwin data cube" which was too expensive to make).

Darwin Festival

Darwin Online

See Ben Fry's very quality work on a very similar topic