Introduction to Computer-aided Text Analysis (CATA):
Computer coding involves the automated tabulation of variables for target content that has been prepared for the computer. Typically, computer coding means having software analyze a set of text, counting key words, phrases, or other text-only markers (Content Analysis Guidebook). Computer coding relies on dictionaries, which are lists of words or phrases that the text-analysis programs use to analyze content. Researchers may need to develop their own dictionaries, however, there are some dictionaries available that have been used in previous research and may be used in future research.
Quantitative CATA programs:
CATPAC II reads text files and produces a variety of
outputs ranging from simple diagnostics (e.g., word and alphabetical
frequencies) to a summary of the "main ideas" in a text. It
uncovers patterns of word usage and produces such outputs as simple word
counts, cluster analysis (with icicle plots), and interactive neural
cluster analysis. A nifty add-on program called Thought View can generate
two and three-dimensional concept maps based on the results of CATPAC
This program has become unavailable for download, but the author welcomes email inquiries.
Diction 7 contains a series of built-in dictionaries that search text
documents for 5 main semantic features (Activity, Optimism, Certainty,
Realism and Commonality) and 35 sub-features (including tenacity, blame,
ambivalence, motion, and communication). After the user's text is
analyzed, Diction compares the results for each of the 60+ dictionary
categories to a "normal range of scores" determined by running
more than 50,000 texts through the program. Users can compare their text
to either a general normative profile of all 50,000+ texts OR to any
of six specific sub-categories of texts (business, daily life,
entertainment, journalism, literature, politics, scholarship) that can be
further divided into 36 distinct types (e.g., financial reports, computer
chat lines, music lyrics, newspaper editorials, novels and short stories,
political debates, social science scholarship). In addition, Diction
outputs raw frequencies (in alphabetical order), percentages, and
standardized scores; custom dictionaries can be created for additional
DIMAP stands for DIctionary MAintenance
Programs, and its primary purpose is dictionary development. The program
includes a variety of tools for lexicon building rooted in computational
linguistics and natural language processing (Litkowski, 1992). With DIMAP,
users can build, manage, edit, maintain, search and compare custom and
established dictionaries. The program also includes a text analysis module
called MCCA (the lite version of which is described below).
This historic, venerable program found new life on the Internet as an online version for quite a while, but in recent years seems to have been somewhat "orphaned." One may contact the keeper of the program via email.
HAMLET II 3.0
"The main idea of HAMLET II 3.0(c) is to search text files for words or categories in a given vocabulary list, and to count their joint frequencies within any specified context unit, within sentences, or as collocations within a given span of words."
LIWC 2015 (Linguistic Inquiry and Word Count) was developed to measure emotional, cognitive, social and other psychological constructs within written or transcribed text. The program has a set of 82 built-in dictionaries that include linguistic dimensions, word categories
tapping psychological constructs, and personal concern categories. The
program also allows users to create custom dictionaries.
Though somewhat hampered by quirks such as limited function availability,
the lite version of MCCA analyzes text by producing frequencies,
alphabetical lists, KWIC, and coding with built-in dictionaries. The
built-in dictionaries search for textual dimensions such as activity,
slang, and humor expression. The program's window-driven output makes
sorting and switching between results easy. MCCA also includes a
multiple-person transcript analysis function suitable for examining plays,
focus groups, interviews, hearings, TV scripts, other such texts.
As the name suggests, MonoConc primarily produces concordance information. These results can be sorted and displayed in several different user-configurable ways. The program also produces frequency and alphabetical information about the words in a given corpus.
ParaConc is a bilingual/multilingual concordance program designed to be used for contrastive corpus-based language research. For Macintosh, Windows version announced.
PCAD 3 applies the Gottschalk-Gleser content analysis scales (which measure the magnitude of clearly defined and categorized mental or emotional states) to transcriptions of speech samples and other machine-readable texts. In addition to deriving scores on a variety of scales, including anxiety, hostility, social alienation, cognitive impairment, hope, and depression, the program compares scores on each scale to norms for the demographic groups of subjects. It can also explain the significance and clinical implications of scores and place subjects into clinical diagnostic classifications derived from the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV), developed by the American Psychiatric Association.
PolyAnalyst offers data mining and text mining capabilities. PolyAnalyst derives actionable knowledge from large volumes of text and structured data, delivers custom reports and predictive models. Covering the complete data analysis cycle from data loading and integration to modeling and reporting, PolyAnalyst offers a comprehensive selection of algorithms for automated analysis of text and structured data. The system enables users to perform numerous knowledge discovery operations: Categorization, clustering, prediction, link analysis, keyword and entity extraction, pattern discovery, and anomaly detection.
The Profiler Plus text-coding platform emerged out of work by the developers in conjunction with government agencies. It is a general purpose text analysis system that allows the application of numerous provided coding schemes developed by scholarly experts in political science, psychology, and psychiatry.
This program is designed mainly to help clinicians identify and document specific language problems. It executes a myriad of analyses, including types of utterances (e.g., incomplete, unintelligible, nonverbal), mean length of utterances, number and length of pauses and rate of speaking, and frequencies for sets of word (e.g., negatives, conjunctions, and custom dictionaries).
Designed for the analysis of sentiment, or opinion mining, in "short, informal text" (e.g., Tweets), SentiStrength produces "automatic sentiment analysis of up to 16,000 social web texts per second." The algorithm is designed to assess both positive and negative strength separately for each text.
Social Science Automation
Social Science Automation, Inc. provides three programs for automated text analysis products and services. Offerings include solutions for media analysis, campaign and election media evaluation, athlete achievement, profiling, and forensic psycholinguistics.
Text Analytics for Surveys 4.0.1 (IBM SPSS)
This software, designed primarily for the analysis of open-ended survey responses, it uses natural language processing (NLP) technologies to automatically analyze key words and group texts into categories.
The TEXTPACK program was originally designed for the analysis of open-ended survey responses. It now produces word frequencies, alphabetical lists, KWIC and KWOC (KeyWord Out of Context) searches, cross references, word comparisons between two texts, and coding according to user-created dictionaries. This multi-unit data file output can be imported in statistical analysis software.
"TextQuest is program for the analysis of texts. It offers a varity of analyses, from a simple word list up to a content analysis or readability analysis."
"T-LAB software is an all-in-one set of linguistic, statistical and graphical tools designed to allow you to enjoy text analysis."
VBPro (Program available but no longer supported)
Outputs frequency and alphabetical word lists, key words in context (KWIC), and coded strings of word-occurrence data based on user-defined dictionaries. In addition, it includes a multidimensional concept-mapping sub-program called VBMap that measures the degree to which words co-occur in a text or series of texts. Miller, Andsager and Riechert (1998), for example, used the program to compare the press releases sent by 1996 GOP presidential candidates to the coverage the candidates received in the press. The program helped the researchers (a) generate a list of key words appearing in the text and (b) generate a map showing the relative positions of candidates, in both press releases and media coverage, to each other and on key issues in the election (e.g., family values, education). Developed by Mark Miller, it served as the inspiration for Yoshikoder, and is available for download on the Yoshikoder website.
WordSmith does basic CATA functions such as concordances, key word searches, and word list, with a rich selection of options for each.
This companion to the Simstat statistical analysis program includes several exploratory tools, such as cluster analysis and multidimensional scaling, for the analysis of open-ended survey responses and other texts. It also codes based on user-supplied dictionaries and generates word frequency and alphabetical lists, KWIC, multi-unit data file output, and bivariate comparisons between subgroups. It provides particularly impressive graphic display options. One particularly noteworthy feature of the program is a dictionary building tool that uses a variety of textual databases to help users build a comprehensive categorization system.
Yoshikoder is perhaps the ultimate do-it-yourself freeware CATA program. "The Yoshikoder is a cross-platform multilingual content analysis program developed by Will Lowe as part of the Identity Project at Harvard's Weatherhead Center for International Affairs. You can load documents, construct and apply content analysis dictionaries, examine keywords-in-context, and perform basic content analyses, in any language...You can construct, view, and save keywords-in-context. You can write content analysis dictionaries. Yoshikoder provides summaries of documents, either as word frequency tables or according to a content analysis dictionary. You can also apply a dictionary analysis to the results of a concordance, which provides a flexible way to study local word contexts. Yoshikoder's native file format is XML, so dictionaries and keyword-in-context files are non-proprietary and human readable."
Computer software for the support of text interpretation, text management and the extraction of conceptual knowledge from documents (theory building). Also has the capability to handle video sequences, recorded interviews, photos, maps, music, movies, the nightly news, videocasts and podcasts. Application areas include social sciences, economics, educational sciences, criminology, market research, quality management, knowledge acquisition, and theology.
The Ethnograph v6.0
Software for qualitative research and data analysis, facilitates the management and analysis of text based data such as transcripts of interviews, focus groups, field notes, diaries, meeting minutes, and other documents. According to the Ethnograph homepage it is the most widely used software for qualitative data analysis since 1985.
MAXQDA supports all individuals performing qualitative data analysis and helps to systematically evaluate and interpret texts. It is used in a wide range of academic and non-academic disciplines, such as in Sociology, Political Science, Psychology, Public Health, Anthropology, Education, Marketing, Economics and Urban Planning.
"More than just a tool for organizing and managing data, NVivo offers an intuitive qualitative data analysis experience that helps you uncover deeper research insights."
"Designed for Semantic Classification, Keyword Extraction, Linguistic and Qualitative Analysis, Tropes software is a tool for content analysis research in the Information Science, Market Research, Sociological Analysis, Scientific and Medical studies and more..."
Video, Audio and Image Analysis
"To gain accurate and reliable data about facial expressions, FaceReader is the most robust automated system that will help you out."
The aim of the MoCA project is to extract structural and semantic content of videos automatically. Different applications have been implemented and the scope of the project has concentrated on the analysis of movie material such as can be found on TV, in cinemas and in video-on-demand databases. Analysis features developed and used within the MoCA project fall into four different categories: (1) features of single pictures (frames) like brightness, colors, text, (2) features of frame sequences like motion, video cuts, (3) features of the audiotrack like audio cuts, loudness andÂ (4) combination of features of the three classes to extract e.g. scenes.
The Observer XT
The Observer XT is a program for behavioral coding and analysis. It allows researchers to gather rich and meaningful data, record time automatically and accurately, integrate video and physiology in behavioral studies, calculate statistics, assess reliability, and create transition matrices.
Transana is software for researchers who want to analyze digital video or audio data. Transana lets you analyze and manage your data- transcribe it, identify analytically interesting clips, assign keywords to clips, arrange and rearrange clips, create complex collections of interrelated clips, explore relationships between applied keywords, and share your analysis with colleagues. Transana is free and open-source.
CATA presentations from K. Neuendorf's content analysis grad students:
WordStat and Yoshikoder (.ppt)
CATPAC and LIWC2001 (.ppt)
CATPAC and MCCALite (.ppt)
Diction5 and General Inquirer PC (.ppt)
PCAD 2000 (.ppt)
VBPro and Yoshikoder (.ppt)