Search API Documentation
Service version: DDSWS v1.1
Document last revised: $Date: 2012/09/26 22:54:04 $
The Search API uses a REST-RPC hybrid approach to accept requests expressed as HTTP argument/value pairs and respond with structured data in XML or JSON format. Search requests operate over a Lucene index of terms. The API is avilable from the Digital Discovery System (DDS), the Digital Collection System (DCS), and the NSDL Collection System (NCS).
Table of Contents
Definitions and conceptsThe Search API uses a REST-RPC hybrid approach to accept requests expressed as HTTP argument/value pairs. Requests may be made using the HTTP GET or POST method, which behave identically and vary only in the length of the request allowed (GET has a limited request length whereas POST is unlimited). Responses are returned in XML or JSON format (XML by default), which varies in structure and content depending on the request as shown below in the examples section of this document.
HTTP request formatThe format of the request consists of the base URL followed by the ? character followed by one or more argument=value pairs, which are separated by the & character. Each request must contain one verb=request pair, where verb is the literal string 'verb' and request is one of the API request strings defined below. All arguments must be encoded using the syntax rules for URIs. This is the same encoding scheme that is described by the OAI-PMH.
Service requestsThis section defines the available requests, or verbs.
The HTTP request format has the following structure:
[base URL]?verb=request[&additional arguments].
Summary of available requests:
Search - Search across items in the repository using Lucene queries and get a list of matching records.
GetRecord - Get a single record from the repository by ID.
ListFields - Get a list of the search fields in the index.
ListTerms - Get a list of the terms in a given search field or fields.
ListCollections - Get a list of the collections in the repository.
ListXmlFormats - Get a list of the available XML formats from the service.
UrlCheck - Check whether a given resource URL or URLs exists in the repository.
ServiceInfo - Get information about the service end-point and index version.
SearchSummary and usage
The Search request allows a client to search across items in the repository using standard Lucene queries and get a list of matching records. The Search index is composed of search fields, and through the use of query clauses, can be used to apply custom search rank algorithms (see example search queries). The request also provides faceted search, sorting, searching by XML format, date ranges, geospatial bounding box search, and other functionality.
The Search response consists of an ordered set of metadata records, sorted by relevancy. The Search request searches over all XML formats that are available in the repository, unless otherwise specified in the 'xmlFormat' argument as described below. Flow control is managed by the client, which may 'page through' a set of results using the 's' and 'n' arguments as described below.
The Search request accept queries supplied in the standard Lucene Query Syntax. Lucene supports advanced Information Retrieval query clauses such as term and field boosting, wildcard and fuzzy searches, etc. Queries are supplied in the q argument of the request.
The following request performs a search for the term "ocean" and returns 10 search results, starting at position 0:
Textual and fielded searches: The following argument is used to conduct textual and fielded searches and may be performed independently or in combination with other search criteria described below.
Search by collection(s): Records in the repository are grouped into collections. The collection key argument limits the search to one or more collections. If one or more collection key arguments are included with no other search criteria, the search will return all records in the given collection(s). The available collections and their corresponding collection keys may be discovered using the ListCollections request.
Date range searches: The following arguments instruct the service to search in a given index date field and may be performed independently or in combination with other search criteria. The values provided in the fromDate or toDate arguments must be a union date type string of the form yyyy-MM-dd or an ISO8601 UTC datastamp of the form yyyy-MM-ddTHH:mm:ssZ. Example dates include 2004-07-08 or 2004-07-26T21:58:25Z. The fields that are available for searching by date are listed below. If supplied, the date range portion of the search criteria must match a given record in order for it to be included in the results.
Geospatial searches: Geospatial searches operate over each record that has associated with it a geographic footprint (a geographic region representing the record's area of relevance) in the form of a box (defined below). A geospatial query takes a query region (also in the form of a box) and a spatial predicate (one of "within," "contains," "overlaps,") and returns all documents that 1) have a geographic footprint that 2) has the predicate relationship to the query region.
Formally, a box is a geographic region defined by north and south bounding coordinates (latitudes expressed in degrees north of the equator and in the range [-90,90]) and east and west bounding coordinates (longitudes expressed in degrees east of the Greenwich meridian and in the range [-180,180]). The north bounding coordinate must be greater than or equal to the south. The west bounding coordinate may be less than, equal to, or greater than the east; in the latter case, a box that crosses the ±180° meridian is described. As a special case, the set of all longitudes is described by a west bounding coordinate of -180 and an east bounding coordinate of 180.
The following arguments instruct the service to conduct a geospatial query over the subset of records that contain a geospatial footprint. Geospatial queries may be performed independently or in combination with other search criteria. To perform a geospatial query, all five of the required geospatial arguments must be included, otherwise none may be included, and thus are conditionally required. If an error in the request arguments is encountered, the service will return an appropriate error response and message. The optional geospatial argument may be included if desired.
Flow control: A search client can control the flow of paging through a set of search results and the size of the result set using the s (starting offset) and n (number returned) arguments. As an example, when a search is initially performed, the client might construct a request that supplies the arguments s=0 and n=10 to return up to the first 10 matching results. The client would then page through the set of results by issuing subsequent requests indicating s=10 and n=10 for the next ten results, s=20 and n=10 for results 20 through 30 and so forth up to
Additional arguments: The following arguments may also be supplied in the request.
Sorting the response: The following arguments instruct the service to sort the response by one or more index fields or relevancy score. The service sorts the entire result set lexically prior to returning the requested portion of the results. Only one of the sort arguments may be supplied in the request. If no sort argument is indicated, results are sorted in descending order by relevancy score. To use the contents of an element or attribute in the record XML for sorting, specify a keyword XPath search field. Any other field that exists in the index as a single token or keyword may also be used.
As a convenience, the following sort arguments may be used as a shorthand way to specify sorting by a single field only:
The Search request supports faceting over categories that have been defined in the index. A category is an aspect of indexed documents which can be used to classify the documents. For example, in a collection of books at an online bookstore, categories of a book can be its price, author, publication date, binding type, and so on. A facet category may be flat or hierarchal, containing one or more levels in a taxonomy/hierarchy path.
In faceted search, in addition to the standard set of search results, the service also returns facet results, which are lists of subcategories for certain categories. For example, for the price facet, one might get a list of relevant price ranges; for the author facet, one might get a list of relevant authors; and so on. In most UI's, when users click one of these subcategories, the search is narrowed, or drilled down, and a new search limited to this subcategory (e.g., to a specific price range or author) is performed.
Include the following required and optional arguments to enable faceting features. These must be included in addition to the normal arguments for Search like
To Find Available Facet Categories
To find the categories that are available for faceted search in the index, use the
To Retrieve Facet Counts
To Drill-down into a Facet
The facet functionality in the Search API is implemented with the Lucene faceting library. For background information about this library and faceting in general, see the Faceted Search User's Guide.
Errors and exceptions
See error and exception conditions.
Search for the word ocean.
Search for the word ocean and limit the search to grade range High (9-12).
Search for all ADN records new to the repository since July 7th, 2004 and sort descending by the wndate field.
http://nsdl.org/dds-search? verb=Search&s=0&n=10&fromDate=2004-07-08&dateField=wndate &sortDescendingBy=wndate&xmlFormat=adn-localized
GetRecordSummary and usage
The GetRecord request is used to retrieve a single record from the repository in one of the available XML formats.
The following request displays the metadata for record ID DLESE-000-000-000-001 displayed in it's native XML format:
Errors and exceptions
See error and exception conditions.
Request the record id DLESE-000-000-000-337 and get the response in ADN format. Shown without the required encoding, for clarity.
ListFieldsSummary and usage
The ListFields request is used to get all search fields that reside in the index.
The following request lists all fields in the index:
ListTermsSummary and usage
The ListTerms request is used to get all search terms that exist in the index for a given field or fields. For each term the response indicates the number of times it appears in the index (termCount) as well as the number of documents (records) it appears in (docCount).
The following request lists all terms in the index for field 'title':
ListCollectionsSummary and usage
The ListCollections request is used to discover the collections in the repository, collection metadata and the collection keys used in the Search request. Clients should use this request to generate user interface widgets for selecting collections to search from, and to display collection information and metadata to users.
The following request lists the collections that are in the repository and all available metadata about each collection:
Errors and exceptions
See error and exception conditions.
See link above
ListXmlFormatsSummary and usage
The ListXmlFormats request is used to discover the XML formats that are available from the repository as a whole or for a single record in the repository. Clients should use this request to discover the available XML formats and the keys that may be supplied in the 'xmlFormat' argument of the Search or GetRecord requests.
The Service is able to disseminate any number of XML formats depending on the record collections that reside in the repository. Some common formats include OAI Dublin Core (oai_dc), NSDL Dublin Core (nsdl_dc), DLESE collection (dlese_collect), ADN (ADEPT/DLESE/NASA) (adn), News&Opps (news_opps), and DLESE annotation (dlese_anno).
Certain records may be disseminated in multiple alternative formats. For example, records that were originally cataloged in the ADN format may also be returned in the oai_dc, nsdl_dc, and other formats. When a record is requested in a non-native format, it's XML is transformed to the requested format using XSLT or other transformation prior to being returned by the service.
The following request lists the XML formats that may be disseminated from this service and their corresponding search keys:
Errors and exceptions
See error and exception conditions.
Show all XML formats available for ID DLESE-000-000-000-001.
UrlCheckSummary and usage
The UrlCheck request is used to check whether a given URL is in the DDS repository. This request supports the use of the * wildcard construct. The * character, or wildcard construct, indicates that any character combination is a valid match. For example, a search for http://www.dlese.org/myResource* will match the two URLs http://www.dlese.org/myResource1.html and http://www.dlese.org/myResource2.html. The wildcard construct may appear at any position in the URL argument except the first position.
The following request searches for all records in the repository that have a URL ending in '.pdf':
Errors and exceptions
See error and exception conditions.
Determine whether the URL 'http://epod.usra.edu/' is in the repository. Shown without the required encoding, for clarity.
Determine whether the URL 'http://epod.usra.edu/' or 'http://www.marsquestonline.org/index.html' is in the repository.
http://nsdl.org/dds-search? verb=UrlCheck&url=http://epod.usra.edu/& url=http://www.marsquestonline.org/index.html
Determine whether a URL that begins with 'http://www.dlese.org' is in the repository. The * character acts as a wildcard, which may appear at any position in the URL argument except the first position.
Determine whether the URL 'http://epod.usra.edu/zzzz' is in the repository. In this case no matching records are found.
Summary and usage
Service responsesService responses are returned in XML or JSON format and vary in structure and content depending on the request made. The content and structure of the response from each of the requests are described above in their respective sections. This section describes common response structures that are returned by the service across all requests.
Common response elementsSeveral requests in the protocol share common XML elements in their responses. These include the <head> and <additionalMetadata> elements, which are described below.
The head element
The head element appears in the Search, GetRecord, UrlCheck responses. The head element is used to return information about a single record. This includes the ID of the record, the collection in which the record is a member of, the XML format of the record that was returned, the native XML format of the record, the date the record was last modified, the whatsNewDate and an additionalMetadata element.
Head element example:
The additionalMetadata element
The additionalMetadata element appears in Search, GetRecord, UrlCheck and the vocabulary list class of responses. The additionalMetadata element is used to return additional information related to the record's format type, referred to as realms. The information realms include adn and dlese_collect, and each contains slightly different information related to underlying format type.
additionalMetadata element example:
Error and exception conditionsIf an error or exception occurs, the service returns an <error> element with the type of error indicated by a code attribute. Clients are advised to test the value of these codes and respond with an appropriate message to users. For example, if a user conducts a search that has no matches, the code
Example error response
Request a record id that does not exist in the repository using GetRecord.
By default, all responses are output in XML format. To get JSON output, include the argument output=json in the request. Additionally, a callback argument callback=function may be included to wrap the JSON output in parentheses and a function name of your choosing. The JSON output by the service is a direct translation of the XML structure into JSON.
Removing namespaces from responsesNamespaces can be removed from the XML and JSON output from the service, which can simplify working with and processing the output.
By default, all responses are returned with the namespaces that appear in the requested format disseminated from the repository. To remove namespaces, include the argument transform=localize in the request.
This section describes the search fields that are available in the Search request.
The repository index contains fields that are extracted from each of the XML records within, and a given repository may contain records in many different native XML formats. Searches within a given field operate over the set of records that contain that field. For example, a search in the
Fields may contain plain text, controlled vocabularies or encoded field values.
The default field for queries
The default field used by the query parser is
How search fields are generated
At index creation time, each record is inserted in the repository in it's native XML format. The indexer extracts standard, XPath and custom search fields from the content of the native XML and additional fields associated with the item may also be extracted from other sources, such as text derived from a crawl of the resource described by the metadata record. The indexer then generates a single entry containing each of the fields and inserts it into the repository. All records are guaranteed to contain certain fields such as the
Searching across and within specific XML formats
The Search request operates over and disseminates records in any available XML format. By default, searches operate over the available fields for all records in the repository regardless of format, and results may contain records of mixed XML formats. For example, a search for default:ocean searches the for the term ocean in the default field across all records in the repository and may return records in
Requesting search results in a specific XML format: Certain XML formats can be disseminated from the service in multiple formats, for example records that reside natively as
Limiting search to specific XML formats: Each record contains the special field
The xml format keys that may be used in the
Text versus stemmed text
When searching in a text field, exact terms are matched. For example a search for ocean will return all records that contain the exact term ocean in the given field. Where indicated, certain textual fields have stemming applied to them using the Porter stemmer algorithm (snowball variation). When searching in a field that has been stemmed, all records containing morphologically similar terms in the given field are matched. For example a search for stems:ocean will return all records that contain the terms ocean, oceans or oceanic in the stems field. Note that when searching in a stemmed field, the client should not apply stemming to the terms it supplies for search. Stemming will be applied automatically by the search engine for these fields and no pre-processing is necessary by the client.
Standard Search Fields
The following search fields are generally available for all XML formats in the repository. This is implementation specific for each repository - see Configure Search Fields, facets, and relationships.
XPath Search Fields
XPath search fields provide separate searchable fields for the contents of every element and attribute found in the native XML of the records. For each element and attribute there are three forms of search fields: text, stemmed text and untokenized keywords. These provide a powerful, flexible way to search for specific text or data within and across the records in the repository.
The XPath fields consist of a prefix followed by an XPath that addresses a specific XML element or attribute in the XML record. Prefixes are one of
The three types of search fields are processed in the following manner:
The XPaths used for the search fields are the most simple form of XPath expression, containing no namespaces or position specifiers. For more information about XPath see XPath Language 1.0. The ZVON XPath Tutorial is also useful. Note that this is not an implementation of XQuery but rather a mapping of simple XPaths to searchable Lucene fields.
For example, consider this simple XML instance document:
<book> <author birthDate="1955-01-25"> <firstName>John</firstName> <lastName>Doe</lastName> </author> <identifier>http://books.org/catalog_123</identifier> </book>
The index will contain the following search fields for this record:
As another example, consider the following Dublin Core
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Ocean Science Leadership Awards</dc:title> <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This is a description of the Ocean Science Leadership Awards... </dc:description> <dc:subject xmlns:dc="http://purl.org/dc/elements/1.1/">Earth system science</dc:subject> <dc:subject xmlns:dc="http://purl.org/dc/elements/1.1/">Education</dc:subject> <dc:format xmlns:dc="http://purl.org/dc/elements/1.1/">text/html</dc:format> <dc:type xmlns:dc="http://purl.org/dc/elements/1.1/">Text</dc:type> <dc:identifier xmlns:dc="http://purl.org/dc/elements/1.1/"> http://www.usc.edu/org/cosee-west/quikscience/OceanLeadershipAwards.html </dc:identifier> </oai_dc:dc>
The following Lucene queries are examples that match specific text and data in this record. As with all fielded Lucene queries, these queries consist of a field name followed by a colon ":" and then followed by the term(s) to search for. Note that XPaths do not contain namespaces or position specifiers:
/stems//dc/title:oceans - Matches the stemmed form of the term ocean found in the title element of the XML record.
/text//dc/subject:education - Matches the term education found in one of the subject elements of the XML record.
/key//dc/format:"text/html" - Matches the untokenized keyword term text/html found in the format element of the XML record.
Searching by indexed XPath
In addition to the XPaths fields, a special field named
indexedXpaths:"/dc/subject" - Matches all records that have any value in the /dc/subject field.
Conversely, the following query:
allrecords:true !indexedXpaths:"/dc/subject" - Matches all records that have no value in the /dc/subject field.
Relation Search Fields
The DDS data model supports a notion of relationships between records in the repository. Relation search fields provide a means to search for records based on the content of their related records. For example, each record in the repository has a
XPath relation search fields
XPath relation search fields follow the same syntax as the XPath fields except they contain an additional prefix that specifies the relation. Fields begin with
For example, consider a record that is a member of the following collection:
<collectionRecord> <general> <fullTitle>Science books</fullTitle> <description>This collection has books about science.</description> </general> <approval> <collectionStatuses> <collectionStatus date="2010-10-05T13:23:43Z" state="Accessioned"/> </collectionStatuses> </approval> <access> <key libraryFormat="book" static="true" redistribute="false">sciBooks</key> </access> <metaMetadata> <catalogEntries> <catalog entry="COLLECTION-123"/> </catalogEntries> </metaMetadata> </collectionRecord>
The index will contain the following relation search fields for the records in this collection:
etc. (not all fields shown here).
A search for /relation.memberOfCollection//text//collectionRecord/general/fullTitle:oceans will return all records whoes collection have the word oceans in thier
Additional relation search fields
Custom Search Fields
Custom search fields are available for specific XML formats as indicated below. Additional implementation specific custom search fields that are not described here may also be available for a given DDS repository configuration.
Textual content - These fields contain the text of the content of the resources themselves, extracted by crawling the first page of the resource. These are available for all ADN resources in the reository whose primary content is in HTML or PDF.
Textual vocabulary fields - These fields contain DLESE controlled vocabularies that have been indexed as plain text.
Defined key fields - These fields contain finite sets of key values that may be used to limit searches to a sub-set of records.
Fields available for searching by value or range of value - These fields may be searched by exact value or by range of value:
Fields available for searching by date - These fields may be supplied in the 'dateField' argument of the Search request:
Example search queries
This section shows some examples of performing searches using the Search request. To perform these searches, the values shown below should be supplied in the 'q' argument, using the Lucene query syntax. Additional arguments may be supplied to the Search request to further limit the search, such as xmlFormat, dateField and the vocabulary fields gr, su, re and cs.
Search for the term 'ocean' in the default field:
Search for the term 'ocean' in the stems field. This will return documents containing morphologically similar terms including ocean, oceans and oceanic:
Search for the terms 'currents in the oceans' in the stems field. Notice that the client should supply the plain english version of the terms without pre-stemming them. In this example the resulting search matches documents that contain both currents, current or currently AND oceans, ocean, or oceanic (the terms 'in' and 'the' are stop words that are dropped for the purpose of search):
stems:(currents in the oceans)
Search for resources that that have an average star rating of 3.5 to 5.0:
itemannoaveragerating:[3.500 TO 5.000]
Search for resources that contain 'noaa.gov' in their URL:
Search for the term ocean within resources from 'noaa.gov':
url:http*noaa.gov* AND ocean
Search for term 'estuary' in the stems field, and limit the search to subject biological oceanography (subject key 02):
stems:estuary AND su:02
Search for the term 'ocean' in the default field, and boost the ranking of results that contain 'ocean' in their title (stemmed) (uses the special clause allrecords:true to select the set of all records). Note that this clause returns the same number of results as if the search were performed only over the word 'ocean' in the default field, but it applies additional boosting for records that contain the term 'ocean' in their title (stemmed), which augments the search rank of the results that are returned.
ocean AND (allrecords:true OR titlestems:ocean^2)
Show all records with subject biological oceanography, and boost results that contain florida in the title (stemmed), description or placeNames fields (uses the clause allrecords:true to select the set of all records):
su:02 AND (allrecords:true OR titlestems:florida*^20 OR description:florida*^20 OR placeNames:florida^20)
whatsNewDate - A date that describes when an item was new to the repository. Generally this corresponds to the item's accession date or the date in which the item first became accessible in the repository.
Configure search fields, facets, and relationships
The following document provides information for system administrators who are installing and managing a DDS repository system, which includes the Digital Discovery System (DDS) and the NSDL Collection System (NCS).