Digital Discovery System (DDS)
NSDL Search API Server (prod1)

Configuring Search Fields, Facets, and Relationships

 

These instructions describe how to configure standard and custom search fields, facet categories, and relationships for any XML framework that is made available through the Search API. This information is provided for system administrators who are installing or managing a DDS repository system, which includes the Digital Discovery System (DDS) and the NSDL Catalog System (NCS). While it is not necessary to configure a framework in order for it to be used effectively in the repository, doing so adds additional search functionality that may be useful.

This document assumes familiarity with Apache Tomcat, Lucene, servlet configurations, and XML.

How search fields, facets, and relationships are generated

At index creation time, each record is inserted in the repository in it's native XML format. The indexer extracts standard, custom and XPath search fields and facet categories from the contents of the XML, establishes any relevant relationships, then generates a single entry containing each of the fields, facet categories, and data from related records and inserts it into the index. All records are guaranteed to contain certain fields such as the default and stems fields, as well as XPath fields for the native XML record and its related records, which are created automatically.

For detailed information about search fields and the content within them, see the Search Service documentation (Search fields section).

How to configure search fields, facets, and relationships

Each XML framework in the DDS can have a corresponding configuration file that is used to define standard and custom search fields, facet categories, and relationships for that framework. Standard search fields include title, description, ID, URL and geospatial bounding box coordinates. Custom search fields and facet categories can be defined for any content extracted from the XML document and/or it's related documents, and relationships can be defined to establish relations that connect records in one XML framework with another for the purpose of optimized searching.

 

To configure a specific XML framework, follow these steps:

  1. Add the given XML framework to the search fields configuration index file
  2. Create a configuration file for the XML framework and define the standard and custom search fields, facet categories, and relationships as needed
  3. Start or re-start Tomcat for changes to take place

1. Add XML frameworks to the configuration index file

Add the given XML framework to the search fields configuration index file, which contains a list of the individual configurations files for each XML framework. Entries in the index may contain relative or absolute URIs to the individual framework configuration files that may be located on the local file system (file://) or anywhere on the Web (http://).

The index file is named xmlIndexerFieldsConfigIndex.xml and in a typical DDS installation it can be found in the tomcat context at $tomcat/$context/WEB-INF/conf/xmlIndexerFieldsConfigIndex.xml. The exact location of the index file is indicated by the DDS/DCS/NCS web application's context-param repositoryConfigDir (found in web.xml or server.xml).

Example index file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<XMLIndexerFieldsConfigIndex>
	<!-- List the location of each framework-specific configuration file -->
	<configurationFiles>
		<configurationFile>xmlIndexerFieldsConfigs/oai_dc_search_fields.xml</configurationFile>
		<configurationFile>xmlIndexerFieldsConfigs/my_framework_search_fields.xml</configurationFile>
	</configurationFiles>		
</XMLIndexerFieldsConfigIndex>
	

Each configurationFile element indicates a relative or absolute URI to the individual configuration for the XML framework. The above example points to two framework configuration files, oai_dc_search_fields.xml and my_framework_search_fields.xml, which reside in the directory xmlIndexerFieldsConfigs relative to the index configuration file.

2. Define search fields, facets, and relationships for each XML framework

Each configuration file describes the standard and/or custom search fields and facet categories for an XML framework and where the content for those fields reside in the XML instance documents, as well as relationships across XML frameworks in the repository. For the following discussion, see the example configuration file below.

The xmlFormat or schema attribute of the XMLIndexerFieldsConfig element defines which framework the configuration is for, and only one or the other may be used. The xmlFormat corresponds to the XML format key that the repository system is indexing, for example oai_dc, nsdl_dc, comm_anno, adn, etc. To provide a schema-specific configuration, for example if a given repository is working with two versions of the same framework, indicate the schema location in the schema attribute. If there are two configurations that operate over the same framework, one indicated by xmlFormat and the other schema, the schema definition takes precedence.

Standard search fields

Standard search fields are processed by the indexer in a uniform manner, allowing clients to search the fields in a consistent manner across frameworks.

The standard fields are the following:

Standard Search Field Description Index Fields Generated
id Contains the ID for the record. If not defined, the ID is derived automatically by the indexer. idvalue
url Contains the URL for the resource described by the XML metadata. url
title Contains the title text for the item. title, titlestems
description Contains the description text for the item. description, descriptionstems
geoBBNorth, geoBBSouth, geoBBWest, geoBBEast Contains the north and south latitudes [-90, 90] and the west and east longitudes [-180, 180] for the geographic bounding box footprint that represents this item. n/a - Handled internally by the Search request.

 

To configure a standard search field for a framework, add a standardField element in the configuration field as shown in the example below. The attribute name defines the standard field name (id, url, title, etc). Inside standardField, nested xpath elements should contain XPaths that select the desired content. The xpath element can be repeated and the contents of all repeated elements in the instance documents will be included in the content for that field with the exception of the geographic bounding box fields, which must contain a single element only.

Custom fields and facet categories

Custom fields and facet categories can be defined for any content extracted from the XML document.

To define a custom field or facet category, add a customField element in the configuration field as shown in the example below. Then to define a regular custom field, create an attribute named name or to define a facet category instead, create and attribute named facetCategory. Add additional attributes as needed, which vary depending on whether a regular custom field or facet category is being defined as described in the table below.

Attributes that may appear on the customField element:

Attribute Name Description Valid Values Use in Conjunction With name or facetCategory
name or facetCategory Indicates the name of the custom search field or facet category that is being defined. Use the name attribute to define a custom field or the facetCategory attribute to define a facet category. One or the other must be indicated but not both. The value of the name or facetCategory attribute should contain alpha-numeric characters without spaces. n/a
store Indicates whether to store the content in the index. Stored fields are visible in the admin pages of the DDS/DCS/NCS repository system web application. yes, no name
type Indicates the type of field this should be. If type is used, analyzer must not be.

text - Text is processed using the Lucene StandardAnalyzer.

stems - Text is processed using the Lucene SnowballAnalyzer for the english language.

key - Text is processed using the Lucene KeywordAnalyzer, which is case-sensitive and includes the entire element or attribute as a single token.

name
analyzer Indicates the specific Lucene Analyzer to use when processing this field. If analyzer is used, type must not be. Include the fully-qualified Java class that implements a Lucene Analyzer. The class must be in the classpath of the DDS web application. name
indexFieldPreprocessor (Optional) Indicates a concrete instance of IndexFieldPreprocessor that should be used to preprocess the content of this field prior to indexing. Indicate the fully-qualified Java class that implements the org.dlese.dpc.repository.indexing.IndexFieldPreprocessor Interface. The class must be in the classpath of the DDS web application. Omit this attribute if no preprocessing is to be done. name or facetCategory
facetPathDelimeter (Optional) Indicates a delimiter character used to split the input string into a facet path hierarchy.
Examples: facetPathDelimeter=":" to split on colon; facetPathDelimeter="/" to split on a backward slash.
If omitted, the facet category will be flat (e.g. one level deep only)
A single character, for example : or / facetCategory

 

Note that the Lucene Analyzer that is defined for a given field is automatically applied both in the indexer and the searcher.

Inside customField, nested xpath elements should contain XPaths to the content. The xpath element can be repeated and the contents of all repeated elements in the instance documents will be included in the content for that field.

Relationships

Relationships for a given XML framework can be defined that will connect the records written in that framework with other records in the repository. Related records my be connected by either record ID or URL, as defined by the standard id and url fields for the target framework. When a relationship is established, the target record takes on searchable fields from the source record and acquires the given relationship, thereby establishing a link from the target record to the source record with the given relationship name. In addition, the Search and GetRecord API requests can return the related records in the same response, making them quickly and easily accessible.

For example, an annotation framework might define the relationship isAnnotatedBy that attaches the annotation record with the relationship isAnnotatedBy to other records in the repository that contain a given URL. The target records with that URL then acquire searchable XPath fields and optionally custom fields from the annotation record. Say for example that an annotation record were to contain a user-defined tag named "top pick." It would then be possible to construct a Search query to return all the records in the repository that contain the given URL and that have been tagged, by way of the user annotation, as a "top pick," and to retrieve in the same service response not only the resource metadata records but also the associated annotation record(s).

To define a relationship for a given XML framework, add a relationships element with nested relationship elements in the configuration file as shown in the example below. For each relationship element there must be a name attribute that contains the name of the relationship. Nested inside the relationship element must be one xpaths element with one or more nested xpath elements. Each xpath element must contain an attribute type with the value of either id or url. The content of the xpath element must contain an xpath to the ID or URL within the source record that will be used to connect the target record to the source.

XPaths

As the indexer processes the XML records, it first removes namespaces from the documents. This simplifies the XPath notation necessary to select the desired elements and attributes within. Therefore, do not include namespaces in your XPath notation.

To specify the content elements that should be pulled from an oai_dc Dublin Core record, for example, these XPaths would be used to select the given elements:

  • /dc/title - Selects all title elements that are children to the dc element
  • /dc/title[1] - Selects the first title element that is a child to the dc element
  • //title - Selects all title elements anywhere in the XML document

It is also possible to pull in custom field content from related documents, e.g. a document that is associated with the one being indexed by way of a relation. To specify custom field content that should be pulled from a related document, add an XPath in the framework configuration file that starts with the relation prefix specifier (e.g. '/relation.isAnnotatedBy/') followed by the XPath into the related document to the content.

For example, the following notation would be used to index content for the given record from all comm_anno records that assign the isAnnotatedBy relation:

  • /relation.isAnnotatedBy//comm_anno/ASNstandard

For more information about the XPath language, see XPath Language 1.0 and the ZVON XPath Tutorial.

Example search configuration for the oai_dc Dublin Core framework:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- XMLIndexerFieldsConfig attributes: [xmlFormat OR schema] -->
<XMLIndexerFieldsConfig xmlFormat="oai_dc">
	<standardFields>
		<!-- standardField attributes include: 
			name=[id|url|title|description|geoBBNorth|geoBBSouth|geoBBWest|geoBBEast] -->
		<standardField name="url">
			<xpaths>
				<xpath>/dc/identifier</xpath>
			</xpaths>		
		</standardField>
		<standardField name="title">
			<xpaths>
				<xpath>/dc/title</xpath>
			</xpaths>		
		</standardField>
		<standardField name="description">
			<xpaths>
				<xpath>/dc/description</xpath>
			</xpaths>		
		</standardField>	
	</standardFields>
	<customFields>
		<!-- customField attributes include: [name OR facetCategory], [store], [type OR analyzer], 
        		[indexFieldPreprocessor], [facetCategory] -->
		
		<!-- Regular custom fields (use the name attribute) -->
		<customField name="dcIdentifier" store="yes" type="key">
			<xpaths>
				<xpath>/dc/identifier</xpath>
			</xpaths>
		</customField>		
		<customField name="dcType" store="yes" type="text">
			<xpaths>
				<xpath>/dc/type</xpath>
			</xpaths>
		</customField>
		<customField name="dcMySubjectTags" store="yes" analyzer="org.example.MySubjectTagAnalyzer" 
        			indexFieldPreprocessor="org.example.MySubjectTagIndexFieldPreprocessor">
			<xpaths>
				<xpath>/dc/subject</xpath>
			</xpaths>
		</customField>

		<!-- Facet category fields (use the facetCategory attribute) -->
		<customField facetCategory="dcTypeFacets"> 
<xpaths>
<xpath>/dc/type</xpath>
</xpaths>
</customField> <!-- Index standards found in comm_anno records that annotate this oai_dc record --> <customField name="ASNIDFromAnno" store="yes" type="key">
<xpaths>
<xpath>/relation.isAnnotatedBy//comm_anno/ASNstandard</xpath>
</xpaths>
</customField> </customFields> <!-- Relationships that this format of record defines. -->
<relationships>
<!-- Relationship of the target record to this.

The given relationship name will be attached to the target record, not this record.

Examples:
target 'isAnnotatedBy' this
target 'isRelatedTo' this
etc.

attributes: name=[relationship name] -->


<!-- Xpath where the source record's ID or URL is stored
attribute type=[id|url] defines whether to look in the target record's id or url field to make the relationship. -->

<relationship name="isRelatedTo">
<xpaths>
<xpath type="url">/dc/relation</xpath>
</xpaths>
</relationship>
</relationships> </XMLIndexerFieldsConfig>

How to verify it's working

Follow these steps to verify that the desired content is being indexed for search as expected:

  1. Place the configuration files in the repository system and make sure Tomcat has been restarted.
  2. Index or re-index the files.
  3. Use the ListFields and ListTerms service requests to verify that the fields are appearing in the index. Standard and custom fields should appear under the name for which they are defined. XPath fields should appear in the ListFields response with the field prefix of /key//[xpath], /stems//[xpath], /text//[xpath].
  4. Facet categories should appear in the ListTerms $facets field response.
  5. Established relationships should appear in the ListTerms indexedRelations response. Relation XPath fields should appear in the ListFields response with the field prefix of /relation.[relationshipName]//key//[xpath], /relation.[relationshipName]//stems//[xpath], /relation.[relationshipName]//text//[xpath].
  6. Perform searches using the Search API or admin search pages to verify the expected results are returned for specific queries against known data in one or more records.

 

Last revised: $Date: 2012/08/15 23:11:06 $

University Corporation for Atmospheric Research (UCAR) National Science Foundation (NSF) National Science Digital Library (NSDL)