Citation of this article: Islam, M.N. (2012). Automatic Indexing: Problems & Prospects. In A. Osswald, S. S. Zabed Ahmed (Eds.), Dynamics of Librarianship in the Knowledge Society: Festschrift in honour of Prof. B. Ramesh Babu, Vol. 1 (pp. 221-232). New Delhi: B.R. Pub. Corp.
Abstract
This paper is based on the clarification of usability of computer-aided indexing practice so called “automatic indexing” as a supplementary of indexing system which is frequently done by human. It shows some unavoidable grounds to put into practice of automatic indexing. It then highlights few techniques and methods being carried out generally. It further emphasizes to delineate common software packages being used in producing index entries throughout the world. This paper concludes to explore some general problems of automatic indexing that normally hinder in the creation and usage of usable index.
Keywords
Automatic indexing, Assigned-Term System, Derived–Term system, Indexing software
1. Introduction
At the present age of Information explosion libraries and information centers have been faced tremendous problem to meet up user demand. Library services and user demands accordingly are now not restricted within their collection only. Since the inception of Internet and more specifically the inception of online cataloging system in libraries during the end of 1970, there has been observed a massive change in retrieval system as well as organization of library collection. Internet and modern ICT related service have been introduced within library premises to make the system more usable and consequently library and other information center faced following multidimensional problems:
n Exponential growth of information
n Information being published in different form and medium
n Organization of such a huge collection
n Multidimensional approach of users
n Ever increased pressure of library user
n After all unsatisfied nature of user
As a result the dependency on the ICT related activities and services have been increasing in this sector. As the world is shifting from manual to automated practices information centers are also following suit, paving way for automated acquisition, processing, and dissemination of information to clienteles. Indexing services may be the solution to providing current and reliable information to information seekers. This is a major challenge to information managers, who are faced, not only with the challenge of selecting, acquiring, and storing the information, with the perennial problem of how to make it available to potential users quickly and easily [1].
2. Indexing System
“Indexing is the process of analyzing the informational content of records of knowledge and expressing the informational content in the language of the indexing system. It involves:
(a) Selecting indexable concepts in a document; and
(b) Expressing these concepts in the language of indexing system (as index entries) and an ordered list.
An indexing system is the set of prescribed procedures (manual and/or machine) for organizing the contents of records of knowledge for purposes of retrieval and dissemination [2].”
2.1 Categories of Indexing System
Simply stated, indexing is the procedure that produces entries in an index. Indexing system can broadly be categorized into following two groups [3]:
n Assigned-Term System: In this system, an indexer must assign terms or descriptors on the basis of subjective interpretation of the concepts implied in the document, and in so doing will have to use some intellectual efforts. Indexers determine the subject matter of the document and then decide what terms in their own filtered vocabulary are appropriate. All indexing languages with vocabulary control devices such as subject heading lists, thesauri, and classification schemes are assigned term systems. This system is so called artificial language system.
n Derived–Term System: Derived term system involves use of author’s actual words as descriptors, without modification. Thus, author indexes, title indexes, citation indexes and automatic index are derived-term systems. Derived-term systems are sometimes called natural-language or free text indexing or indexing by extraction.
3. Automatic Indexing
Using computers to construct indexes is called automatic indexes. Automatic indexing is the process of assigning and arranging index terms for natural language without human intervention [4]. Automatic indexing is based on the assumption that the words in the text and their relationships to each other are sufficient to represent content concepts. This is derived term indexing system as it involves author’s actual words as descriptors [5]. As the number of documents exponentially increases with the proliferation of the Internet, automatic indexing will become essential to maintaining the ability to find relevant information in a sea of irrelevant information [6].
3.1 Reasons for Automatic Indexing
Human indexing is costly and can range in quality from excellent to appalling. With the rapid growth of information, the time lag between publication of a paper and the availability of indexes to that paper has grown frightfully. Adding new people to the staff is not always a solution; it may be economically infeasible, and professionally qualified people may not be available. This is one of the practical reasons that interest turned to the possibility of automatic methods [7].
Tony I. Obaseki (2010) pointed out the following attributes of automatic indexing [8]:
(a) Faster, Easier and Cheaper to Produce: Though Seth A. Maislin observes various shortcomings of automated indexing yet he argues for the use of automated indexing because it is faster and cheaper. Seth asserts that this is one way of achieving the goals of information centers. This view is welcomed by numerous scholars, because automated indexing can deal with the increasing amount of new material being produced that has made manual indexing slow and expensive. Automatically indexing simplifies and speeds up the process, alphabetizing and assigning page numbers. Repetition of index terms is minimal.
(b) Easily Modified: Automated indexing is easily retrievable, revisited, and modified when errors are noticed or due to future developments. This is an obvious advantage over manual indexing.
(c) Transferability: ICT have turned the universe into an information global village. Automated indexing permits information centers to share their information resources globally.
Madely du Preez (2010) pointed out the advantages of using automatic indexing in the following way [9]:
n Predictable
n Becoming more sophisticated
n Less expensive
n Able to extract terms, as well as use clustering
n Help searchers find information
n Is as effective as human indexing
n Can be applied to large volumes of texts where human indexing becomes impossible
n Is cost effective compared to expensive human indexing
n Speeds up the indexing process
3.2 General Techniques of Automatic Indexing
Borko & Bernier (1978), Madely du Preez (2010), Cleveland & Cleveland (1990), describes a number of techniques for selecting index terms automatically. Major observations includes following [10] [11] [12]:
A. Automatic Indexing: Surface View
Automatic indexing starts with words. Word association prompts the linking of target words in a search statement. How it has been done includes following:
n Computers scan texts and create ‘inverted file’ (indexed file) which associates words in the file with position in the texts.
n Matches words in a search statement against ‘inverted files’ to identify texts that have words in common.
B. Automatic Indexing: Deep View
(a) Stop Lists: Function words ( such as articles, conjunctions, prepositions, and pronouns) are usually excluded. As a result,
n It improves simple keyword indexing
n It reduces the size of the index
n It enhances processing of search queries
(b) Counting Words (Go-List): To select all words as index terms on the list that have been used more than a specified minimum number of times in the work being indexed.
n frequency of indexing terms in document is used as a criterion for retrieval
(c) Weighting and Association Factors: Though occurrence of words in a work are not always an indication of subject content. Thus, word counts can’t be used as a sole basis for selection. If the cut-off number of words is set too high, for example, 10 to 12 repetitions, then many useful index headings will be eliminated; if it is set too low, for example, 1 or 2 repetitions, many terms useless as subject guides will be included. This problem can be solved in the following ways:
(i) Weighting by Location: For example, a word appearing in the title might be assigned a greater weight than a word appearing in the body of the work.
(ii) Relative Frequency Weighting: This is based upon the relation between the number of times the word is used in the document being indexed and the number of times the same word appear in the Information Retrieval system.
(iii) Maximum-depth Indexing: This procedure indexes a document by all of its content words and weights these words, if desired, by the number of occurrences in the document.
(iv)Use of Association Factors: By means of statistical association and correlation techniques, the degree of term relatedness, that is, the likelihood that two terms will appear in the same document is computed and used for selecting the index terms.
(d) Stemming
(i) Automatically removes suffixes and word endings to improve retrieval: e.g. indexes, indexing, indexer, indexable becomes ‘index’
(ii) Can be limited to the removal of‘s’
(e) Word Parsing
(i)Use of Noun Phrases: Only nouns and adjectives-noun phrases are used as index terms, these are selected from the title or abstract.
(ii) Grammatical Structure: The relative position of the words in sentence is used to select as index terms. Let’s have a look the following sentences; “The mosquitoes attacked with ferocity of a tiger”. Here ‘mosquitoes’ is the important term not the tiger. But in the sentence “The queen looked at me with her mosquito eyes”. Here mosquito is probably not important.
(iii) Use of Thesaurus: A thesaurus is used to combine synonym, distinguish homonym and group related term together.
(f) Clustering: IR system provides alternative searches based on clustering:
(i) Clustering is based on similarities in the document and search statement.
(ii) Clustering can be used to organize contiguous files in the database.
3.3 Methods of Automatic Indexing
Borko & Bernier (1978) mentioned the following three basic methods of automatic indexing [13]:
A. Statistical or Frequency Analysis of Text
One hypothesis underlying the statistical method of indexing is that the more times a word is used in a document, the more likely it is that the word is an indicator of the subject matter. Based upon this hypothesis, a computer program lists all of the words in a document; the words are grouped by number of occurrences and arranged alphabetically within each frequency. Function words (stop lists) are usually excluded.
B. Syntactic Method
In syntactic method, the computer analyzes sentences according to a grammar (whether the word work is used as noun or a verb) and the relation among the words in the sentence (dog bites man vs man bites dog, as for example) stored in its memory or at least allows for relative positions of words (co-occurrence) in selecting those to be used for indexing. The linguistic model proposed by Chomsky distinguishes between surface and deep structure of language.
As for example “Mary went home with John” and “Mary and John went home together” have different surface structures but the same deep structure. By means of transformational grammar, a sentence can be changed; it can go through a series of transformations that will exhibit its deep structure
C. Semantic Method:
Semantic analysis helps to establish class relationship among terms so as to associate words with simple concepts. This method tends to identify the subjects and content bearing words of the document or surrogate text. A number of procedures have been studied to index under this method:
n Keyword normalization (to exclude prefixes and suffixes);
n Dictionary or thesaurus references in which the extracted word is looked up in a thesaurus; and
n Various classification techniques aimed at grouping related words [14].
3.4 Automatic Indexing Software
"Automated indexing software" is, according to the common definition, software that analyzes text and produces an index without human involvement [15]. There are a number of different types of microcomputer based software packages which are used for indexing
A. Concordance Generators
The simplest are concordance generators, in which a list of the words found in the document, with the pages they are on, is generated [16].
B. Computer-aided Indexing Packages or Standalone Program Computer-aided indexing packages are used by many professional indexers to enhance their work. They enable the indexer to view the index in alphabetical or page number order, can automatically produce various index styles, and save much typing [17]. Here is a short description of such type of indexes:
n Macrex was the first back-of-the-book indexing software package available for professional indexers. Today, Macrex handles back-of-the-book indexing, periodical indexes and web indexing [18]. It is developed by Macrex Indexing Services and runs under Windows NT, 2000, XP, Vista and Windows 7. It is also used successfully on Intel Macs running Parallels [19].
n Cindex provides standard features for indexing books, newspapers and periodicals. These features include sorting, cross-reference checking and formatting [20]. It is developed by Indexing Research and suitable to both Windows (Windows XP or Windows vista) and Macintosh running OS 10.4 or higher operating system [21].
n SKY Index also provides standard features for back-of-the-book indexing. Advanced features include auto-complete and "drag-and-drop" embedding into Microsoft Word documents [22]. It is developed by Sky software and suitable to Windows XP, Vista, or Windows 7 operating system [23].
C. Embedded Indexing
Embedded indexing software is available with computer packages such as word processors (e.g. Microsoft word), PageMaker, and frame maker (e.g. Adobe frameMaker). With embedded indexing the document to be indexed is on disk, and the indexer inserts tags into the document to indicate which index terms should be allocated for that page. It does not matter if the document is then changed, as the index tags will move with the part of the document to which they refer [24].
D. Special-Purpose Application Programs
There are also special-purpose application programs to assist indexers in their work. Some facilitate tasks that may arise when indexing any type of work; some facilitate tasks that are unique to a specific type of indexing, such as legal indexing; most work in conjunction with one or more standalone indexing programs. They include: CaseAbbrev, CaseRev, emDEX, EM/Index, EntryExpander, etc. These special-purpose programs are used almost exclusively by professional indexers or technical writers [25].
3.5 Problems of Automatic Indexing
Automatic indexing is an easy and quick way of assigning and arranging index terms for natural-language texts without human intervention. Nevertheless, computer-generated results are often more like concordances (lists of words in a document) than truly usable indexes [26]. There are several reasons for this.
A. Lack of Good Artificial Intelligence
Seth A. Maislin (2004) pointed out that a machine can easily cull capitalized words from a textbook to create an approximation of an index of names but due to lack of good artificial intelligence, software is not going to differentiate between names like "David Kelley" and places like "San Francisco," since they are both of the same formats and used the same way. It also won't know that "Bill Clinton" is also "William Jefferson Clinton." And certainly it can't tell when the name is being mentioned in an un-useful and trivial way, as are the names in this paragraph! So machine often fail to parse full sentences of ideas and recognizing the core ideas, the important terms, and the relationships between related concepts throughout the entire text. He also recommended automatic software as a supplementary of human indexing to speed up and simplify the indexing process i.e. machine can be used to alphabetize the entries, reformat the index, and manipulate page numbers [27].
B. Problem to Determine the Relationship among Terms
Furthermore, a computer cannot determine relationships among words and concepts, and therefore cannot place subentries, synonyms and cross-references properly decide what is and is not a relevant reference [28].
C. Misspelled or Various Usages of Words
A computer assisted indexing system can only sort the terms that appear in a document according to certain preprogrammed patterns recognize concepts which are discussed over a range of pages limit the search to relevant entries (vs. every occurrence of a word) function when a word is misspelled (for example. google the word "backwords" and notice how often it is used where the word "backwards" is meant) consider how terms develop varied meanings -- for example, a "key" on pianos, for computers, to unlock doors, to unlock puzzles, for security, or as a geographic feature, as in Key West. At the same time, a computer is unable distinguish an author's use of multiple terms to indicate one concept: for example, in the computer manual field, 'application', 'software', and 'program' are often used interchangeably [29].
D. Indexing Software can’t be Substitute with Human Brain
Indexing software is a tremendous aid to the professional indexer. Though the vendors who claim that the services of a professional indexer can be replaced by running a software program on the text of a book, the intellectual and analytical work of indexing is the task of the human brain, and no software program can duplicate it [30].
E. Unavailable E-Books
Another reason that automatic indexing may be unsuited to book indexing is that book indexes are not usually available electronically, and cannot be used in conjunction with powerful search software [31].
F. Problem of Full Text Searching
When trying to locate specific information quickly, users found the full-text search method troublesome. Full-text searching requires users to specify search terms that exactly match the terminology of the text. Differences in word usage among different authors, plus variations in spelling and hyphenation, can lead to missed "hits." Because full-text searches look for specific character strings, not ideas or concepts, many trivial hits result. Using full-text search features such as Boolean operators (AND, OR, NOT) also require some skill and experience [32]. Some automatic indexing algorithms treat the hyphen as a space, so that the characters before and after the hyphen become separate words (``on-line'' becomes ``on'' and ``line''!). Some systems ignore the hyphen, treating it as nothing, so that ``MS-DOS'' becomes ``MSDOS'' and ``full-text'' becomes ``fulltext'' [33].
G. Problem to Determine Headings and Subheadings
Headings in an index do not depend solely on terms used in the document; they also depend on terminology employed by intended users of the index and on their familiarity with the document. For example: in medical indexing, separate entries may need to be provided for brand names of drugs, chemical names, popular names and names used in other countries, even when certain of the names are not mentioned in the text. Another reason is that headings and subheadings should be tailored to the needs and viewpoints of anticipated users. Some are aimed at users who are very knowledgeable about topics addressed in the document; others at users with little knowledge. Some are reminders to those who read the document already; others are enticements to potential readers. To date, no one has found a way to provide computer programs with the judgment, expertise, intelligence or audience awareness that is needed to create usable indexes. Until they do, automatic indexing will remain a pipe dream [34].
4. Conclusion
Information and Communication Technology, especially Internet has brought a revolutionary change in storage, organization, demand, and dissemination of information. Nowadays producing index and rendering it to the right person at the right time is a real challenge. As a result library professionals have been compelled to depend upon automatic indexing system. There is always difference between amount of available pertinent information and the actual time to read it. Users always need relevant information quickly. All these made automatic indexing system is the ultimate choice for the library. Though there are some limitations of automatic system but we have to focus on the development of good artificial intelligence to generate more users friendly, intuitive and expert knowledge program in producing index.
5. References
- Tony I. Obaseki. “Automated Indexing: The Key to Information Retrieval in the 21st Century”. Library Philosophy and Practice (2010), <http://www.webpages.uidaho.edu/~mbolin/obaseki.htm> (15 January 2012).
- Borko, Harold, and Charles L. Bernier. Indexing Concepts and Methods. New York: Academic Press, 1978.
- Cleveland, Donald B., and Ana D. Cleveland. Introduction to Indexing and Abstracting. Englewood: Libraries Unlimited, 1990.
- Martin Tulic. “Automatic indexing”, 3 April 2005, <http://www.anindexer.com/about/auto/autoindex.html > (20 January 2012).
- Cleveland , Op. cit.
- Wikipedia. “Automatic indexing”, 2011 < http://en.wikipedia.org/wiki/Automatic_indexing> (10 January 2012)
- Cleveland , Op. cit.
- Tony, Op. cit.
- Madely du Preez. “Automatic indexing: what is it and how does it work?” The indexer in publication (2010) < www.asaib.org.za/docs/DuPreez_Automatic_indexing.pps > (25 January 2012)
- Borko, Op. cit.
- Madely du Preez, Op. cit.
- Cleveland, Op. cit.
- Borko, Op. cit.
- Riaz, Muhammad. Advanced Indexing and Abstracting Practices. New Delhi: Atlantic Publishers, 1989.
- Seth A. Maislin. “Notes on Automatic Indexing”, October 2004, <http://taxonomist.tripod.com/indexing/autoindex.html> (25 January 2012)
- Shuter, Janet. “Standards for indexes: Where do they come from and what use are they?” Indexing, Providing Access to Information: Looking Back, Looking Ahead (1993).
- Glenda Browne. “Automatic indexing”, 2 September 2007, <http://www.webindexing.biz/glendas-articles-mainmenu-117/34-indexing/362--automatic-indexing> (15 January 2012).
- Fred Brown. “Book Indexing Software: tools for creating professional indexes”, <http://www.allegrotechindexing.com/tools.htm> (25 January 2012)
- Drusilla Calvert and Hilary Calvert. “Macrex homepage”, 19 August 2010, <www.macrex.com> (14 January 2012).
- Fred, Op. Cit.
- Indexing Research “CINDEX homepage”, <www.indexres.com> (14 January 2012)
- Fred, Op. Cit.
- Sky Software. “SKY Index homepage”, 9 September 2011 <www.sky-software.com> (14 January)
- Glenda, Op. Cit.
- Martin Tulic. “Software for indexing”, 30 April 2004, <http://www.anindexer.com/about/sw/swindex.html> (15 January 2012)
- ibid
- Seth, Op. Cit.
- Martha Osgood. “Back Words Indexing: CAN'T THE INDEX BE WRITTEN BY A COMPUTER?” 1996, <http://backwordsindexing.com/Comp.html> (15 January 2012).
- ibid
- Ross, Marilyn, and Sue Collier. The complete guide to self publishing: Everything you need to know to write, publish, promote, and sell your own book. Cincinnati: Writer’s digest book, 2002
- Mulvany, Nancy and Jessica Milstead. “Indexicon, The Only Fully Automatic Indexer: A Review”, Key Words , 2(1994):17-23.
- Fred Brown. “Electronic Media and the Future of Indexing” 1995, <http://www.allegrotechindexing.com/article01.htm> (12 January 2012).
- Anderson, James D., and Perez-Carballo. The nature of indexing: how humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort”, Information Processing and Management, 37 (2001): 255-277
- Martin, Op. cit.