Introduction to Computer-aided Text Analysis (CATA):
Computer coding involves the automated tabulation of variables for target content that has been prepared for the computer. Typically, computer coding means having software analyze a set of text, counting key words, phrases, or other text-only markers (Content Analysis Guidebook). Computer coding relies on dictionaries, which are lists of words or phrases that the text-analysis programs use to analyze content. Researchers may need to develop their own dictionaries, however, there are some dictionaries available that have been used in previous research and may be used in future research.
Quantitative CATA programs:
CATPAC reads text files and produces a variety of
outputs ranging from simple diagnostics (e.g., word and alphabetical
frequencies) to a summary of the "main ideas" in a text. It
uncovers patterns of word usage and produces such outputs as simple word
counts, cluster analysis (with icicle plots), and interactive neural
cluster analysis. A nifty add-on program called Thought View can generate
two and three-dimensional concept maps based on the results of CATPAC
analyses (one especially neat feature of Thought View allows users to look
at the results through 3-D glasses and experience MDS-style output like
never before, in true, movie theater-style, 3-D fashion!).
This program lets you make full concordances to texts of any size,
limited only by available disk space and memory. You can also make fast concordances, picking your selection of words from text, and make
Web Concordances: turn your concordance into linked HTML files, ready for publishing on
theWeb, with a single click. See the original Web Concordances for examples.
Diction 5.0 contains a series of built-in dictionaries that search text
documents for 5 main semantic features (Activity, Optimism, Certainty,
Realism and Commonality) and 35 sub-features (including tenacity, blame,
ambivalence, motion, and communication). After the user’s text is
analyzed, Diction compares the results for each of the 40 dictionary
categories to a "normal range of scores" determined by running
more than 20,000 texts through the program. Users can compare their text
to either a general normative profile of all 20,000-plus texts OR to any
of 6 specific sub-categories of texts (business, daily life,
entertainment, journalism, literature, politics, scholarship) that can be
further divided into 36 distinct types (e.g., financial reports, computer
chat lines, music lyrics, newspaper editorials, novels and short stories,
political debates, social science scholarship). In addition, Diction
outputs raw frequencies (in alphabetical order), percentages, and
standardized scores; custom dictionaries can be created for additional
DIMAP stands for DIctionary MAintenance
Programs, and its primary purpose is dictionary development. The program
includes a variety of tools for lexicon building rooted in computational
linguistics and natural language processing (Litkowski, 1992). With DIMAP,
users can build, manage, edit, maintain, search and compare custom and
established dictionaries. The program also includes a text analysis module
called MCCA (the lite version of which is described below).
Inquirer (Internet version) (http://www.wjh.harvard.edu/~inquirer/)
This venerable, still widely-used program has found new life on the World
Wide Web. The online version of the General Inquirer gets our vote for the
simplest and quickest way to do a computer text analysis–simply visit
the Internet General Inquirer site, type or paste some text into a box,
click submit, and your text will be analyzed. The Internet General
Inquirer codes and classifies text using the Harvard IV-4 dictionary,
which assess such features as valence, Osgood’s three semantic
dimensions, language reflecting particular institutions, emotion-laden
words, cognitive orientation, and more. The program also returns
cumulative statistics (e.g., simple frequencies for words appearing in the
text) at the end of each analysis. Though we could not find any
information on a software-based version of the Inquirer, creator Phillip
J. Stone holds summer seminars on the program at the University of Essex.
"The main idea of HAMLET © is to search a text file for words in a given vocabulary list, and to count joint frequencies within any specified context unit, or as collocations within a given span of words. Individual word frequencies (fi) , joint frequencies
(fij) for pairs of words (i,j), both expressed in terms of the chosen unit of context, and the corresponding standardised joint frequencies are displayed in a similarities matrix, which can be submitted to a simple cluster analysis and multi-dimensional scaling. A further option allows comparison of the results of applying multi- dimensional scaling to matrices of joint frequencies derived from a number of texts, using Procrustean Individual Differences Scaling
INTEXT/TextQuest--Text Analysis Software
INTEXT is a program designed for the analysis of texts in the humanities
and the social sciences. It performs text analysis, indexing, concordance,
KWIC, KWOC, readability analysis, personality structure analysis, word
lists, word sequence, word permutation, stylistics, and more.
TextQuest is the Windows version of INTEXT. It performs all of the
INTEXT analyses, but through an easier-to-use Windows interface.
Designed with linguists in mind, Lexa Corpus Processing Software is a
suite of programs for tagging, lemmatization, type/token frequency counts,
and several other computer text analysis functions.
LIWC (Lingustic Inquiry and Word Count software) (https://www.erlbaum.com/shop/tek9.asp?pg=products&specific=1-56321-208-0)
LIWC has a series of 68 built-in dictionaries that search text files and
calculate how often the words match each of the 68 pre-set dimensions
(dictionaries), which include linguistic dimensions, word categories
tapping psychological constructs, and personal concern categories. The
program also allows users to create custom dictionaries. The program seems
especially useful to psychologists who wish to examine patient narratives.
Though somewhat hampered by quirks such as limited function availability,
the lite version of MCCA analyzes text by producing frequencies,
alphabetical lists, KWIC, and coding with built-in dictionaries. The
built-in dictionaries search for textual dimensions such as activity,
slang, and humor expression. The program’s window-driven output makes
sorting and switching between results easy. MCCA also includes a
multiple-person transcript analysis function suitable for examining plays,
focus groups, interviews, hearings, TV scripts, other such texts.
As the name suggests, MonoConc primarily produces concordance information. These results can be sorted and displayed in several different user-configurable ways. The program also produces frequency and alphabetical information about the words in a given corpus.
ParaConc is a bilingual/multilingual concordance program designed to be used for contrastive corpus-based language research. For Macintosh, Windows version announced.
PCAD 2000 (http://www.gb-software.com/)
PCAD 2000 applies the Gottschalk-Gleser content analysis scales (which measure the magnitude of clearly defined and categorized mental or emotional states) to transcriptions of speech samples and other machine-readable texts. In addition to deriving scores on a variety of scales, including anxiety, hostility, social alienation, cognitive impairment, hope, and depression, the program compares scores on each scale to norms for the demographic groups of subjects. It can also explain the significance and clinical implications of scores and place subjects into clinical diagnostic classifications derived from the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV), developed by the American Psychiatric Association.
PolyAnalyst offers data mining and text mining capabilities. PolyAnalyst derives actionable knowledge from large volumes of text and structured data, delivers custom reports and predictive models. Covering the complete data analysis cycle from data loading and integration to modeling and reporting, PolyAnalyst offers a comprehensive selection of algorithms for automated analysis of text and structured data. The system enables users to perform numerous knowledge discovery operations: Categorization, clustering, prediction, link analysis, keyword and entity extraction, pattern discovery, and anomaly detection.
PROTAN (for PROTocol ANalyzer) is a computer-aided content analysis system. It addresses the question of how does the text look like. To achieve this first task, PROTAN rests on a series of semantic dictionaries that are part of the system. The second task to which PROTAN is tuned is to answer the question of what the text is talking about. What are the main themes in it?
SALT (Systematic Analysis of Language Transcripts) (http://www.languageanalysislab.com/)
This program is designed mainly to help clinicians identify and document specific language problems. It executes a myriad of analyses, including types of utterances (e.g., incomplete, unintelligible, nonverbal), mean length of utterances, number and length of pauses and rate of speaking, and frequencies for sets of word (e.g., negatives, conjunctions, and custom dictionaries). The Salt Reference Database, described online, allows users to compare the results of their SALT analyses to normative language measures collected via a sample of more than 250 children of various ages, economic backgrounds, and abilities in the Madison, Wisconsin area.
Social Science Automation (http://socialscience.net/Default.aspx)
Social Science Automation, Inc. provides three programs for automated text analysis products and services. Offerings include solutions for media analysis, campaign and election media evaluation, athlete achievement, profiling, and forensic psycholinguistics.
TABARI (Text Analysis By Augmented Replacement Instructions) (http://www.ku.edu/~keds/software.dir/tabari.html)
The successor to KEDS, this program is specifically designed for analyzing short news stories, such as those found in wire service reports. It codes international event data (which are essentially codes recording the interactions between actors) using pattern recognition and simple grammatical parsing. The authors have developed a number of dictionaries to help code event data. The WEIS coding scheme, for example, can determine who acts against whom, as in the case of an Iraqi attack against Kuwait. When such an event is reported in a news story, the program can automatically code the aggressor, victim and action, as well as the date of the event. TABARI is currently only available for Macintosh, but a Windows version is in the works.
TextAnalyst is an intelligent text mining and semantic information search system. TextAnalyst implements a unique neural network technology for structural processing of texts written in natural language. This technology automates the work with large volumes of textual information and can be applied effectively to perform the following tasks: creation of knowledge bases expressed in a natural language, as well as creation of hypertext, searchable, and expert systems; AND automated indexing, topic assignment, and abstracting of texts.
The TEXTPACK program was originally designed for the analysis of open-ended survey responses. It now produces word frequencies, alphabetical lists, KWIC and KWOC (KeyWord Out of Context) searches, cross references, word comparisons between two texts, and coding according to user-created dictionaries. This multi-unit data file output can be imported in statistical analysis software. A free demo version is available.
TextSmart by SPSS Inc. (Program no longer supported)
This software, designed primarily for the analysis of open-ended survey responses, uses cluster analysis and multidimensional scaling techniques to automatically analyze key words and group texts into categories. Thus, it can "code" without the use of a user-created dictionary. TextSmart has a pleasant, easy-to-use Windows interface that allows for quick sorting of words into frequency and alphabetical lists. It also produces colorful, rich-looking graphics like bar charts and two-dimensional MDS plots. For more information about this program, a PDF is available here.
VBPro (Program available but no longer supported)
Outputs frequency and alphabetical word lists, key words in context (KWIC), and coded strings of word-occurrence data based on user-defined dictionaries. In addition, it includes a multidimensional concept-mapping sub-program called VBMap that measures the degree to which words co-occur in a text or series of texts. Miller, Andsager and Riechert (1998), for example, used the program to compare the press releases sent by 1996 GOP presidential candidates to the coverage the candidates received in the press. The program helped the researchers (a) generate a list of key words appearing in the text and (b) generate a map showing the relative positions of candidates, in both press releases and media coverage, to each other and on key issues in the election (e.g., family values, education). Developed by Mark Miller, it is available for download on the yoshikoder website.
WordStat v5.1 (http://www.kovcomp.co.uk)
This add-on to the Simstat statistical analysis program includes several exploratory tools, such as cluster analysis and multidimensional scaling, for the analysis of open-ended survey responses and other texts. It also codes based on user-supplied dictionaries and generates word frequency and alphabetical lists, KWIC, multi-unit data file output, and bivariate comparisons between subgroups. The differences between subgroups or numeric variables (e.g., age, date of publication) can be displayed visually in high resolution line and bar charts and through 2-D and 3-D correspondence analysis bi-plots. One particularly noteworthy feature of the program is a dictionary building tool that uses the WordNet lexical database and other dictionaries (in English and five other languages) to help users build a comprehensive categorization system.
"The Yoshikoder is a cross-platform multilingual content analysis program developed as part of the>Identity Project at Harvard's Weatherhead Center for International Affairs.
You can load documents, construct and apply content analysis dictionaries, examine keywords-in-context, and perform basic content analyses, in any language.
The Yoshikoder works with text documents, whether in plain ASCII, Unicode (e.g. UTF-8), or national encodings (e.g. Big5 Chinese.) You can construct, view, and save keywords-in-context. You can write content analysis dictionaries. Yoshikoder provides summaries of documents, either as word frequency tables or according to a content analysis dictionary. You can also apply a dictionary analysis to the results of a concordance, which provides a flexible way to study local word contexts. Yoshikoder's native file format is XML, so dictionaries and keyword-in-context files are non-proprietary and human readable."
Computer software for the support of text interpretation, text management and the extraction of conceptual knowledge from documents (theory building). Also has the capability to handle video sequences, recorded interviews, photos, maps, music, movies, the nightly news, videocasts and podcasts. Application areas include social sciences, economics, educational sciences, criminology, market research, quality management, knowledge acquisition, and theology.
Code-A-Text is a software package that was written to help in the training of psychotherapists. It was originally designed to facilitate the analysis of therapeutic conversations where clinicians, teachers and research workers wanted to understand the ideas and structures underlying the "texts". Recently, Code-A-Text has been applied to other types of "texts", including process (field notes), responses to open ended questionnaires and metaplan analyses. Code-A-Text was written by Dr Alan Cartwright, a psychotherapist who is Director of the Centre for the Study of Psychotherapy, University of Kent at Canterbury, UK.
Computer Assisted Qualitative Data Analysis Software (CAQDAS) Networking Project (http://caqdas.soc.surrey.ac.uk/)
CAQDAS provides practical support, training and information in the use of a range of software programs designed to assist qualitative data analysis. They also provide platforms for debate concerning the methodological and epistemological issues arising from the use of such software packages and conduct research into methodological applications. Download demo versions of various qualitative analysis packages through this organization.
Crawdad Desktop 2.0 (http://www.crawdadtech.com/html/01_software.html)
Crawdad Desktop 2.0 was specially designed for academic researchers and students who need to accurately and deeply analyze qualitative data.Crawdad Desktop 2.0 provides key word scores, concept mapping, browsing, full-text search, comparison, clustering, and theme analysis capabilities.
The Ethnograph v6.0 (http://www.qualisresearch.com/)
Software for qualitative research and data analysis, facilitates the management and analysis of text based data such as transcripts of interviews, focus groups, field notes, diaries, meeting minutes, and other documents. According to the Ethnograph homepage it is the most widely used software for qualitative data analysis since 1985.
Kwalitan is a support program for the analysis of qualitative data, such as the protocols of interviews and observations, or existing written material, such as articles from newspapers, annual reports of enterprises, ancient manuscripts, and so on. In fact, Kwalitan is a special purpose database program. The program has been developed in accordance with the narrowly elaborated procedures of the so called grounded theory approach, in which the researcher tries to generate a theoretical framework by means of an interpretative analysis of the qualitative material.
MAXQDA supports all individuals performing qualitative data analysis and helps to systematically evaluate and interpret texts. It is also a powerful tool for developing theories and testing the theoretical conclusions of the analysis. It is used in a wide range of academic and non-academic disciplines, such as in Sociology, Political Science, Psychology, Public Health, Anthropology, Education, Marketing, Economics and Urban Planning.
QDA Miner (http://www.provalisresearch.com/QDAMiner/QDAMinerDesc.html)
QDA Miner is an easy-to-use qualitative data analysis software package for coding textual data, annotating, retrieving and reviewing coded data and documents. The program can manage complex projects involving large numbers of documents combined with numerical and categorical information. QDA Miner also provides a wide range of exploratory tools to identify patterns in codings and relationships between assigned codes and other numerical or categorical properties.
QSR NVIVO 8 (http://www.qsrinternational.com/)
NVIVO 8 removes many of the manual tasks associated with the analysis of audio, video, pictures or documents (classifying, sorting and arranging information), so researchers can explore trends, build and test theories and ultimately arrive at answers to questions.
Designed for Semantic Classification, Keyword Extraction, Linguistic and Qualitative Analysis, Tropes software is a tool for content analysis research in the Information Science, Market Research, Sociological Analysis, Scientific and Medical studies fields.
Video, Audio and Image Analysis
C-I-SAID offers a set of unique features which allow the depth of analysis of sound, text or video, normally associated with a programme dedicated to qualitative analysis, within an analytic framework which is quantitative in orientation. Thus open ended coding, using comments and textual annotations, is accompanied by rating scales that can be categorical or numerical in format. There are also sections within the programme such as the lexicon (largely devoted to thematic content analysis) and the acoustic manager (devoted to the measurement of the volume, pitch and speed of speech) that generate statistics. The main outputs from C-I-SAID are reports, tables and charts which can be accompanied by a range of statistics these can be used to describe the data in tandem with qualitative methods.
FaceReader is the world's first tool that is capable of automatically analyzing facial expressions, providing users with an objective assessment of a person's emotion.
MoCA Project (http://www.informatik.uni-mannheim.de/informatik/pi4/projects/MoCA/)
The aim of the MoCA project is to extract structural and semantic content of videos automatically. Different applications have been implemented and the scope of the project has concentrated on the analysis of movie material such as can be found on TV, in cinemas and in video-on-demand databases. Analysis features developed and used within the MoCA project fall into four different categories: (1) features of single pictures (frames) like brightness, colors, text, (2) features of frame sequences like motion, video cuts, (3) features of the audiotrack like audio cuts, loudness andÂ (4) combination of features of the three classes to extract e.g. scenes.
Transana is software for researchers who want to analyze digital video or audio data. Transana lets you analyze and manage your data- transcribe it, identify analytically interesting clips, assign keywords to clips, arrange and rearrange clips, create complex collections of interrelated clips, explore relationships between applied keywords, and share your analysis with colleagues. Transana is free and open-source.
CATA Presentations from K. Neuendorf's COM 533, Content Analysis Class:
WordStat and Yoshikoder (.ppt)
CATPAC and LIWC2001 (.ppt)
CATPAC and MCCALite (.ppt)
Diction5 and General Inquirer PC (.ppt)
PCAD 2000 (.ppt)
VBPro and Yoshikoder (.ppt)