Reading from a Distance – Data as Text
An extended survey of digital initiatives in arts and humanities practices in India was undertaken during the last year. Provocatively called 'mapping digital humanities in India', this enquiry began with the term 'digital humanities' itself, as a 'found' name for which one needs to excavate some meaning, context, and location in India at the present moment. Instead of importing this term to describe practices taking place in this country - especially when the term itself is relatively unstable and undefined even in the Anglo-American context - what I chose to do was to take a few steps back, and outline a few questions/conflicts that the digital practitioners in arts and humanities disciplines are grappling with. The final report of this study will be published serially. This is the third among seven sections.
03. Reading from a Distance – Data as Text
The concepts of text and textuality have been central to the discourse on language and culture, and therefore by extension to most of the humanities disciplines, which are often referred to as text-based disciplines. The advent of new digital and multimedia technologies and the internet has brought about definitive changes in the ways in which we see and interpret texts today, particularly as manifested in new practices of reading and writing facilitated by these tools and dynamic interfaces now available in the age of the digital. The ‘text’ as an object of enquiry is also central to much of the discussion and literature on DH given that many scholars, particularly in the West trace its antecedents to practices of textual criticism and scholarship that stem from efforts in humanities computing. Everything from the early attempts in character and text encoding  to new forms and methods of digital literary curation, either on large online archives or in the form of social media such as Storify  or Scoop-it  have been part of the development of this discourse on the text. Significant among these is the emergence of processes such as text analysis, data mining, distant reading, and not-reading, all of which essentially refer to a process of reading by recognising patterns over a large corpus of texts, often with the help of a clustering algorithm . The implications of this for literary scholarship are manifold, with many scholars seeing this as a point of ‘crisis’ for the traditional practices of reading and meaning-making such as close reading, or an attempt to introduce objectivity and a certain quantitative aspect, often construed as a form of scientism, into what is essentially a domain of interpretation (Wieseltier 2013). But an equal number of advocates of the process also see the use of these tools as enabling newer forms of literary scholarship by enhancing the ability to work with and across a wide range and number of texts.
The simultaneous emergence of new kinds of digital objects, and a plethora of them, and the supposed obscuring of traditional methods in the process is perhaps the immediate source of this perceived discomfort. There are different perspectives on the nature of changes this has led to in understanding a concept that is elementary to the humanities. Apart from the fact that digitisation makes a large corpus of texts now accessible, subject to certain conditions of access of course, it also makes texts 'massively addressable at different levels of scale' as suggested by Michael Witmore (Witmore 2012: 324-327, emphasis as in the original). According to him: "[A]ddressable here means that one can query a position within the text at a certain level of abstraction" (Ibid. 325). This could be at the level of character, words, lines etc that may then be related to other texts at the same level of abstraction. The idea that the text itself is an aggregation of such ‘computational objects’ is new, but as Witmore points out in his essay, it is the nature of this computational object that requires further explanation. In fact, as he concludes in the essay, "textuality is addressability and further ... this is a condition, rather than a technology, action or event" (Ibid. 326). What this points towards is the rather flexible and somewhat ephemeral nature of the text itself, particularly the digital text, and the need to move out of a notion of textuality which has been shaped so far by the conventions of book culture, which look to ideal manifestations in provisional unities such as the book (Ibid. 327).
Of Texts and Hypertextuality
An example much closer home of such new forms of textual criticism is that of 'Bichitra' , an online variorum of Rabindranath Tagore’s works developed by the School of Cultural Texts and Records at Jadavpur University. The traditional variorum in itself is a work of textual criticism, where all the editions of the work of an author are collated as a corpus to trace the changes and revisions made over a period of time. The Tagore variorum, while making available an exhaustive resource on the author’s work, also offers a collation tool that helps trace such variations across different editions of works, but with much less effort otherwise needed in manually reading through these texts. Like paper variorum editions, this online archive too allows for study of a wider number and diversity of texts on a single author through cross-referencing and collation. Prof. Sukanta Chaudhuri , Professor Emeritus, Department of English and School of Cultural Texts and Records at Jadavpur University, Kolkata has been part of the process of setting up this variorum. According to him the most novel aspects of this platform, or as he calls it - 'integrated knowledge site' - are to do with these functions of cross-referencing and integration. The bibliography is a hyperlinked structure, which connects to all the different digital versions of a particular text (the most being 20 versions of a single poem). The notion of a bibliography has always evoked hypertextuality – the possibility to link and cross - reference texts, but with the advent of the digital, this possibility has been fully realized, as seen in the case of the hypertext . For collation, the project team developed a unique software, titled 'Prabhed,' (meaning difference in Bengali) that helps to assemble text at three levels (a) chapter in novel, act/scene in drama, canto in poem; (b) para in novel or other prose, speech in drama, stanza in poem; (c) individual words.. For instance, you can choose a particular section of a book, poem or play - and compare its occurrences across different editions and versions of the work to note their matches and differences. If two paragraphs have been removed from one chapter, and put into another, that can be traced through the collation software. If a particular word has been omitted in a later edition, or if certain lines have been rearranged in a poem, these changes can be tracked . What makes the search engine 'integrated' is not simply that it can search all Tagore's works in one go, but that it links up with the bibliography and thereby with the actual text of the works. It is interesting to note here the different changes that the text undergoes to become available for study on a digital platform, where it is amenable to intense searching and querying of this kind. It is now possible to search across a large corpus of texts, for minute changes in words or sentences, and ask questions of these in terms of their usage, instances and contexts of their occurrence, thus facilitating a kind of enquiry previously never undertaken in textual studies.
The project however is not without its challenges, as Prof. Chaudhuri further outlines. Working with Indic scripts is a persistent problem for digital initiatives in India. In Bengali some work has been done in the form of a scientifically designed keyboard software called Avro, which stores all the conjunct letters preserving their separate characteristics . Developing Optical Character Recognition (OCR) for scanned material in Indian languages remains a crucial issue for most digitization and archival initiatives in India. Other issues include the problem of vowel markers appearing before the consonants, even if phonetically they follow and are keyed in afterwards. To get the font and keyboard software to recognize this is a big challenge. The third challenge, especially in the case of works printed from the nineteenth century to the middle of the twentieth century, is that there are vast differences in spelling; the same word can be spelt in different ways, and as there is no lexicon, one may not do any kind of general search. There is also the issue of a high degree of inflection in the language. A word may have a suffix (or, vibhakti) attached to it to indicate the case: one for the subject of the sentence, another for the object, another for the possessive case and so on. These are multiplied by the different forms of the verbs. The development of a lexicon in Bengali would be one of the ways to resolve many of these issues. However, as most people can only see and interact with the digital interface of Bichitra, and not really understand the process behind it, or the amount of work involved in making the platform work the way it does, funding for research and development, maintenance and sustainability is difficult to obtain. Backroom file management, which includes both paper and digital files remains a big but largely invisible task on such a platform. The total number of files generated from Bichitra is tens of millions or hundreds of millions, and many of these are offline files which would not even go on to the website. Hence while uploading the files, the basic groundwork for a retrieval system for different files serving different functions had already been laid, including the creation of a bibliography, which was a huge exercise in itself. The process of making text available as hypertext is labor that is invisibilized, and is rarely or never available to the end user.
Prof. Chaudhuri also speaks of ways in which the notion of textuality has been rendered differently through the use of the internet and digital technologies. Digital or electronic text has helped theorize better the notion of a fluid text - the fact that a text is never complete, but only bound between the covers of a book at a given point of several processes that are technological as well as social. The notion of the text itself as an object of enquiry has undergone significant change in the last several decades. Various disciplines have for long engaged with the text - as a concept, method or discursive space - and its definitions have changed over time that have added dimensions to ways of doing the humanities. With every turn in literary and cultural criticism in particular, the primacy of the written word as text has been challenged, what is understood as ‘textual’ in a very narrow sense has moved to the visual and other kinds of objects. The digital object presents a new kind of text that is difficult to grasp - the neat segregations of form, content and process seem to blur here, and there is a need to unravel these layers to understand its textuality. As Dr. Madhuja Mukherjee, with the Department of Film Studies, at Jadavpur University points out, with the opening up of the digital field, there are more possibilities to record, upload and circulate, as a result of which the very object of study has changed; the text as an object therefore has become very unstable, more so that it already is. Film is an example, where often DVDs of old films no longer exist, so one approaches the 'text' through other objects such as posters or found footage. Such texts also available through several online archives now offer possibilities of building layers of meaning through annotations and referencing. Another example she cites is of the Indian Memory project, where objects such as family photographs become available for study as texts for historiography or ethnographic work. She points out that this is not a new phenomenon, as the disciplines of literary and cultural studies, critical theory and history have explored and provided a base for these questions, but there is definitely a new found interest now due the increasing prevalence of digital methods and spaces.
Shaina Anand, artist and filmmaker, further espouses this thought when she talks about the new possibilities of textual analysis of film that are now possible, particularly in terms of temporal control, first with the DVD, then the internet and now with online archival platforms like Indiancine.ma  and the Public Access Digital Media Archive, or Pad.ma . The first is an online archive of Indian film from the pre-copyright era (so effectively before 1955), while the second is an archive of found and archival footage, images sound clips and unfinished films . Both platforms allow the user to search through an array of material, view/listen to them download or embed them as links. They make available to users not just an online database for storage and retrieval but also a space to work with a range of materials in multiple video and audio formats and themes through annotations and referencing. The annotation tool is perhaps the most innovative aspect of these platforms, wherein a user can pause, isolate a section of a sequence and annotate it using a range of options and filters. The annotations are textual, in the form of comments, commentary and marginalia (in the case of Pad.ma) and can also link to other paraphernalia around the film object, such as posters, images, advertisements and other literature. Users can also contextualize material by adding transcripts, descriptions, events, keywords, and even locating the events in the video on a map. These have brought to the fore several questions on relevance, accessibility and ownership, as in the case of raw footage from films, and opened up possibilities for such materials to be re-contextualized by the reader in different ways. This layering of annotations around the film object also creates a new research object, or text that then necessitates new methods of studying it as well. As opposed to the earlier practice of the researcher/critic having to watch the film first and then comment or analyse it, and relying on memory to generate the scholarship, it is now possible to pause, analyse or read and come back to the film and annotate the text in several ways. What does this do to the film text - the process documenting the form is new, not cinema as a form itself – is a question that comes up quite prominently here. The computational aspect also is important here, given the vast amount of footage that is now available, which then requires better lexical indexing to compute and manage large data sets. This has been a constant endeavour with Pad.ma and Indiancine.ma as well.
As in the case of film, what becomes prominent here is the move to a digital text of some sort. One such example of a digital text perhaps is the hypertext. George Landow in his book on hypertext draws upon both Barthes and Foucault’s conceptualisation of textuality in terms of nodes, links, networks, web and path, which has been posited as the 'ideal text' by Barthes (Landow 2006: 2). Landow’s analysis emphasises the multilinearity of the text, in terms of its lack of a centre, and therefore the reader being able to organise the text according to his own organising principle - possibilities that hypertext now offers which the printed book could not. While hypertext illustrates the possibilities of multilinearity of a text that can be realised in the digital, it may still be linear in terms of embodying certain ideological notions which shape its ultimate form. Hypertext, while in a pragmatic sense being the text of the digital is still at the end of a process of signification or meaning-making, often defined within the parameters set by print culture. As such it is only the narrative, and not the form itself that is multi-linear in hypertext fiction.
Textual Criticism in the Digital
But to return to what has been one of the fundamental notions of textual criticism, the 'text' is manifested through practices of reading and writing (Barthes 1977). So what have been the implications of digital technologies for these processes which have now become technologised, and by extension for our understanding of the text? While processes such as distant reading and not-reading demonstrate precisely the variability of meaning-making processes and the fluid nature of textuality, they also seem to question the premise of the method and form of criticism itself. Franco Moretti, in his book Graphs, Maps and Trees talks about the possibilities accorded by clustering algorithms and pattern recognition as a means to wade through corpora, thus attempting to create what he calls an 'abstract model of literary history' (Moretti 2005: 1). He describes this approach as "within the old territory of literary history, a new object of study." He further says, "Distant reading, I have once called this type of approach, where distance is however not an obstacle, but a specific kind of knowledge: fewer elements, hence a sharper sense of their overall interconnection. Shapes, relations, structures. Forms. Models" (Moretti 2005: 1, emphasis as in original). The emphasis for Moretti therefore is on the method of reading or meaning-making. There seem to be two questions that emerge from this perceived shift - one is the availability of the data and tools that can 'facilitate' this kind of reading, and the second is a change in the nature of the object of enquiry itself, so much so that close reading or textual analysis is not engaging or adequate any longer and calls for other methods of reading.
As is apparent in the development of new kinds of tools and resources to facilitate reading, there is a problem of abundance that follows once the problem of access has been addressed to some extent. Clustering algorithms have been used to generate and process data in different contexts, apart from their usage in statistical data analysis. The role of data is pertinent here; and particularly that of big data. But the understanding of big data is still shrouded within the conventions of computational practice, so much so that its social aspects are only slowly being explored now, particularly in the context of reading practices. Big data as not just a reference to volume but also its other aspects of data such as velocity, scope, and granularity among others significantly increases the ambit of what the term covers, with implications for new epistemologies and modes of research (Kitchin 2014). But if one were to treat data as text, as is an eventual possibility with literary criticism that uses computational methods, what becomes of the critical ability to decode the text – and does this further change the nature of the text itself as a discursive object, and the practice of reading and textual criticism as a result. Reading data as text then also presupposes a different kind of reader, one that is no longer the human subject. This would be a significant move in understanding how the processes of textuality also change to address new modes of content generation, and how much the contours of such textuality reflect the changes in the discursive practices that construct it. Most of the debate however has been framed within a narrative of loss - of criticality and a particular method of making meaning of the world. Close reading as a method too came with its own set of problems - which can be seen as part of a larger critique of the Formalists and later New Criticism, specifically in terms of its focus on the text. As such, this further contributes to canonising a certain kind of text and thereby a certain form of cultural and literary production (Wilkens 2012). Distant reading as a method, though also seen as an attempt to address this problem by working with corpora as opposed to select texts, still poses the same issues in terms of its approach, particularly as the text still serves as the primary and authoritative object of study. The emphasis therefore comes back to reading as a critical and discursive practice. The objects and tools are new; the skills to use them need to be developed. However, as much of the literature and processes demonstrate, the critical skills essentially remain the same, but now function at a meta-level of abstraction. Kathleen Fitzpatrick in her book on the rise of electronic publishing and planned technological obsolescence dwells on the manner in which much of our reading practice is still located in print or specifically book culture; the conflict arises with the shift to a digital process and interface, in terms of trying to replicate the experience of reading on paper (Fitzpatrik 2011). Add to this problem of abundance of data, and processes like curation, annotation, referencing, visualisation, abstraction etc. acquire increased valence as methods of creatively reading or making meaning of content (Ibid.). More importantly, it also points towards a change and diversity in the disciplinary method. Where close reading was once the only method by which a text became completely accessible to the reader, it is now possible to approach it through a set of processes, thus urging us to rethink the method of enquiry itself.
Whether as object, method or practice, the notion of textuality and the practice of the reading have undergone significant changes in the digital context, but whether this is a new domain of enquiry is a question we may still need to ask. Matthew G. Kirschenbaum in his essay on re-making reading (quoted earlier in this chapter) suggests that perhaps the function of these clustering algorithms, apart from serving to supplant or reiterate what we already know is to also ‘provoke’ new ideas or questions (Kirschenbaum XXXX: 3). The conflict produced between close and distant reading, the shift from print to digital interfaces would therefore emerge as a space for new questions around the given notion of text and textuality. But if one were to extend that thought, it may be pertinent to ask if DH can now provide us with a vibrant field that will help produce a better and more nuanced understanding of the notion of the text itself as an object of enquiry. This would require one to work with and in some sense against the body of meaning already generated around the text, but in essence the very conflict may be where the epistemological questions about the field are located. The digital text, owing to the possibilities of ‘massive addressability,’ mentioned earlier is now more fluid and socialized. The renewed focus on the textual is most apparent in this manner of imagining the text, using the metaphor of a highly interlinked, networked and shared text. It also puts forth important questions then of how we understand technology a certain way, especially in the context of language and representation as an important factor of understanding new textual objects. Is technology a tool for textual analysis, or is it in inherent to our understanding of the nature of the text? Is the development of these methods of enquiry shaped by certain disciplinary requirements, and do they also challenge or create new conflicts for traditional methods of enquiry? The growth in the study of different media objects, such as video and cinema, and the advent of areas such as media studies, oral history, media archaeologies has further prompted concerns regarding the study of the digital object in these disciplines, and a rethinking of how we understand the notion of the text.
 "The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation." See: http://www.tei-c.org/.
 See: https://storify.com/
 See: http://www.scoop.it/
 For more on text mining see Lisa Guernsey in 'Digging for Nuggets of Wisdom,' in The New York Times, October 16, 2003 http://www.nytimes.com/2003/10/16/technology/circuits/16mine.html?pagewanted=print. For more on data mining, distant reading, and the changing nature of reading practices see Matthew Kirschenbaum in 'The Remaking of Reading,' http://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf.
 See: http://bichitra.jdvu.ac.in/.
 Interview with author, July 30, 2015.
 A term coined by Theodor H. Nelson, which he describes as "a series of text chunks connected by links which offer the reader different pathways." As quoted in George Landow, Hypertext: The Convergence of Contemporary Critical Theory and Technology, Baltimore: John Hopkins University Press, 1992, 2-12.
 Bichitra, 'Collation Guide,' accessed on September 17, 2015, http://bichitra.jdvu.ac.in/bichitra_collation_guide.php.
 Omicron Lab, accessed September 17, 2015. https://www.omicronlab.com/avro-keyboard.html.
 See: http://pad.ma/.
 See: http://indiancine.ma/.
 For more on these platforms see the section on DH institutions in India.
Barthes, Roland. "From Work to Text". In Image, Music, Text. London: Fontana Press, 1977.
Fitzpatrick, Kathleen. "Texts" in Planned Obsolescence: Publishing, Technology and the Future of the Academy. New York: New York University Press, 2011.
Kirschenbaum, Matthew. "The Remaking of Reading". http://www.csee.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf.
Kitchin, Rob. 'Big Data, New Epistemologies, and Paradigm Shifts,' Big Data & Society, 2014, April–June, pp. 1–12, DOI: 10.1177/2053951714528481.
Landow, George. Hypertext: The Convergence of Contemporary Critical Theory and Technology. Baltimore: Johns Hopkins University Press, 1992.
Moretti, Franco. Graphs, Maps, Trees: Abstract Models for a Literary History, Verso, 2005.
Wieseltier, Leon, 'Crimes Against Humanities,' The New Republic, September 3, 2013, http://www.newrepublic.com/article/114548/leon-wieseltier-responds-steven-pinkers-scientism.
Wilkens, Mathew. "Canons, Close Reading and the Evolution of Method". In Debates in the Digital Humanities Ed. M.K. Gold. Minneapolis: University of Minnesota Press, 2012.
Witmore, Michael. "Text: A Massively Addressable Object". In Debates in the Digital Humanities, Ed. M.K. Gold. Minneapolis: University of Minnesota Press, 2012