Sitecore Search - API Crawler with Edge Pagination

Sitecore Search provides different ways to index the items from 3rd party systems with the help of Source Connectors.

There are different types of Source Connectors available:

API Crawler
API Push
Feed Crawler
Web Crawler
Web Crawler(Advanced)

I have provided more details here about the different sources for crawling different types of documents to get indexed into Sitecore Search.

In this article, I would like to share the details about API Crawler in more real-time querying data and adding to the index.

We have a scenario where I need to index all the items along with their placeholder components based on a specific template that has a layout assigned to it.

I decided to utilize the Experience EDGE GraphQL Query to get the data and index it to Sitecore Search. We can use this for non-layout items as well.

This article covers the following:

GraphQL - Layout Query with Components
GraphQL - Item Query
GraphQL - Item Query with Pagination
GraphQL - Search Query with Pagination

GraphQL - Layout Query with Components

Steps to get the item's component details which are provided in the presentation details.

Create a new source with the API Crawler connector selected.
Select Trigger and configure Trigger Type to Request.
In the body use the edge GraphQL query that we used in the playground.

{"query":"query($path: String!,$language: String!,$sitename: String!)
{layout(site:$sitename,language:$language,routePath:$path){item{rendered} }}",
"variables":{"path":"/mobilehome","language":"en","sitename":"website"}}

Add header key = "X-GQL-Token" and Value =(our EDGE token of this environment).
Add header key = "content-type" and Value = "application/json"(This Content Type header is important without it query will not provide results.
Set Method to POST
Set the URL to "https://edge.sitecorecloud.io/api/graphql/v1"

Select Document Extractor and Select the Extractor type to JS.
Select Tagger and Add a new Tagger.
Let's manipulate the results from the query provided in Trigger's body field and assign its results to the attributes that we have already created, If all the required attributes are not yet created then we need to have a Tech Admin role to create it.
Attributes created from Administration Tools ==> Domain Settings ==>Attributes

Code Snippet for Document Extractor for extracting the results of the item and its component, In this code snippet I'm trying to extract the content from one of the component named "QuickLinks".

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    let requests = [];
    let route=response.body.data.layout.item.rendered.sitecore.route;
    if(route.placeholders['headless-mobile'])
    {
      route.placeholders['headless-mobile'].forEach(function(e) {
        if(e.componentName=="QuickLinks")
        {
            e.fields.QuickLinks.forEach(function(links) {
                requests.push({
                    pageTitle:route.fields.Title.value,
                    name: links.name,
                    type: "layoutResult",
                    url:  links.url,
                    id:links.id         
                });
            });
        }
    });
}
return requests;
}

GraphQL - Item Query

Steps to get the details of a particular Sitecore item.

Follow all the steps from the previous approach for creating a Source Connector and setting Triggers.
Update the body with the item query to get its own field details.
Example query to get the item details:

{"query":"query($path: String!,$language: String!){item(language:$language,path:$path){name  id pageUrl:url{url}}}",
"variables":{"path":"/sitecore/content/Experiences/home","language":"en"}}

Set the Document Extractor with Extractor type to JS with the following code snippet to get the results mapped to the index attributes.

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
   let requests = [];
    if (response.body && response.body.data && 
        response.body.data.item && response.body.data.item) {
       let item = body.data.item;                  
        requests.push( {
            name: item.name,
            type: "website",
            url:  item.pageUrl.url,
            id:   item.id      
            });   

    }
    return requests;
}

GraphQL - Item Query with Pagination

Steps to get all the item descendants of a particular item.

Follow all the steps from the previous approach for creating a Source Connector and setting Triggers.
Update the Trigger's Body field with the item query to get an item's first-level child item details. These results will then be used to drill down to their child and so on.
Example query sample to get first-level child items for the provided item by the path variable.

{"query":"query($path: String!,$language: String!){item(language:$language,path:$path){name  children{  results{   name id  path}}}}",
"variables":{"path":"/sitecore/content/site/Home","language":"en"}}

Update the Document Extractor for the JS type with the following code snippet to get the results mapped to the index attributes.

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
   let requests = [];
    if (response.body && response.body.data && 
       response.body.data.item && response.body.data.item.children) {
       let item = body.data.item;        
       item.children.results.forEach(function(e){             
        requests.push( {
            description: e.description, 
            name: e.name,
            type: "website",
            url:  item.pageUrl.url,
            id:   e.id       
        });
    });

    }
    return requests;
}

Now all the records from the GraphQL request will be received and processed by the Document Extractor, but if we need to look for the children of the items received from the results then we need to Add the Request Extractor.
Request Extractor: This extractor helps us to create a new API request to EDGE for getting the child details of the retrieved items.
The API request consists of a GraphQL query (with the dynamic parameter path variable to get the items), and an API URL along with the header to get the child items results from Experience Edge.
No need to modify the Document extractor again for this change.
In the Variable section update the $path from the results.
Code Snippet:

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
   let requests = [];
    if (response.body && response.body.data && 
        response.body.data.item && response.body.data.item.children) {               
        response.body.data.item.children.results.forEach(function(e){ 
        requests.push( {
            method: request.method,         
            url: requests.url,
            headers: request.headers,
            body:JSON.stringify(
                {"query":"query($path: String!,$language: String!){item(language:$language,path:$path){name children{results{   name id }  }  }}","variables":{"path":""+ e.path+"","language":"en"}}
            )

        });
    });
    }
    return requests;
}

GraphQL - Search Query with Pagination

Steps to get all the items based on a particular template and the item path.

We will try to get the results from Experience Edge based on the Search query with multiple where conditions.
Follow all the steps from the previous approach for creating a Source Connector and setting Triggers.
Update the body with the following GraphQL query.
We can't pass directly the Template, and Item IDs as variables in the GraphQL query for the where condition, as it won't return results, so we need to add these Template and Item IDs directly into the query itself. But we can pass variables for other parameters like page size, cursor, etc.

{ "query": "query{ details: search( where:{ AND:[ { name: \"_hasLayout\" value: \"true\" } { name: \"_path\" value: \"{40B46996-E445-4C40-84CB-2045E3862835}\",operator: CONTAINS } { name: \"_templates\" value: \"{63252DE3-1B17-4B29-B8E3-70BAED7EEF65}\", operator: CONTAINS } ] }  first: 5 ) { total pageInfo { endCursor hasNext } results { Id: id Pageurl:url{ url } name path } } }","variables":null}

Configure Document Extractor with the JS Extractor Type and add the following code snippet to crawl the results for the above query.

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
   let requests = [];
    if (response.body && response.body.data && 
     response.body.data.details && response.body.data.details) {               
     response.body.data.details.results.forEach(function(e){             
        requests.push( {
            name: e.name,
            type: "website",
            url: e?.Pageurl?.url,
            id: e.Id,
        });
    });
    }    
    return requests;
}

This GraphQL query has more results, so we will do pagination using Request Extractor.
Request Extractor: This extractor helps us to create a new API request to EDGE to get child page details of the retrieved items.
The API request consists of a GraphQL query (with the dynamic cursor parameter variable to get the items based on pagination), and an API URL along with the header to get the child items results from Experience Edge.
No need to modify the Document extractor again for this change.
In the Variable section update the $cursor from the results endCursor parameter.
Code Snippet for Cursor-based pagination Request:

// Sample extractor function. Change the function to suit your individual needs
function extract(request, response) {
    let requests = [];
    if (response.body && response.body.data && response.body.data.details) {
        if (response.body.data.details.pageInfo.hasNext==true && response.body.data.details.pageInfo.endCursor) {
                requests.push({
                    method: request.method,
                    url: requests.url,
                    headers: request.headers,
                    body: JSON.stringify({
                       "query":`query($cursor: String!) 
                          { details: search( 
                          where:{ AND:[ { name: "_hasLayout" value: "true" } ,
                          { name: "_path" value: "{40B46996-E445-4C40-84CB-2045E3862835}", operator: CONTAINS },
                          { name: "_templates" value: "{63252DE3-1B17-4B29-B8E3-70BAED7EEF65}", operator: CONTAINS } ] } 
                           first: 5, after:$cursor ) 
                           { total pageInfo { endCursor hasNext } 
                           results { Id: id Pageurl:url{ url } name path } } }`,
                        "variables":{"cursor":""+response.body.data.details.pageInfo.endCursor+""}})                  
            });
            return requests;  
    }   
  }
  return requests;
}

Notes and Observations:

While doing paging it is important to set Max Depth to the correct value, if we have set the value as 2 to the Max Depth field then it will not crawl all the URLs that got from the Request Extractor.
Also do not have a line break for the GraphQL query provided directly in the Trigger's body field, this will create the GraphQL error to throw an error, we can use Postman to validate our query works before placing it in the Trigger's Body.
In the Request Extractor, we can have a GraphQL query with a line break by using the backtick(`) operator
Always provide ID attributes if we using to index API-based data
Always provide the Type attribute, without providing it then the index will fail.
Do not use the = sign while assigning attributes with the value, use the colon always.
Any API providing JSON response will be supported, at present it will not support XML response type.

Let's learn and grow together, happy programming 😊

Search This Blog

Sitecore getting started to advanced

Sitecore Search - API Crawler with Edge Pagination

Comments

Post a Comment

Popular posts from this blog

Sitecore Upgrade from 8.1 XP to 10.4 XM Scaled - Part 1

Custom Item Url and resolving the item in Sitecore - Buckets

Sitecore Custom Rule (Action and Condition)