December 14, 2020; Updated on

Bash Scripting to Submit Appropriate URLs through Bing Webmaster API

The Accessing Bing Webmaster Tools API using cURL post demonstrated how to submit URLs in batches to Bing. All you need to do is pass the appropriate URLs to this API. In this post, I will create a Bash script that automatically specifies the URLs of updated pages by referring to the last modified dates in the sitemap.

Retrieve Sitemap and Extract Newer Entries

First, retrieve the sitemap using curl, and extract newer entries than the last submitted entry. Because jq can filter JSON data interactively and concisely in Bash scripting, transcode XML to JSON using the xq command from the yq package. Suppose the following sitemap and transcoded one:

<?xml version='1.0' encoding='utf-8'?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page-0</loc>
    <lastmod>2020-12-31T00:00:00Z</lastmod>
  </url>
  <url>
    <loc>https://example.com/page-1</loc>
    <lastmod>2021-01-01T00:00:00Z</lastmod>
  </url>
  <url>
    <loc>https://example.com/page-2</loc>
    <lastmod>2021-01-02T00:00:00Z</lastmod>
  </url>
</urlset>
{
  "urlset": {
    "@xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "url": [
      {
        "loc": "https://example.com/page-0",
        "lastmod": "2020-12-31T00:00:00Z"
      },
      {
        "loc": "https://example.com/page-1",
        "lastmod": "2021-01-01T00:00:00Z"
      },
      {
        "loc": "https://example.com/page-2",
        "lastmod": "2021-01-02T00:00:00Z"
      }
    ]
  }
}

Then, filter all the elements of the array of the url key using jq. Let the date of the last submitted entry be 2020-12-31T00:00:00Z. Select elements having a lastmod key with a newer date than this one. Then, construct an array from the entire stream using the -s option, and sort them in ascending order by the date of the lastmod key. Because of descending time units, you can compare these dates in ISO 8601 format as strings.

curl https://example.com/sitemap.xml |
    xq |
    jq '.urlset.url[] | select(.lastmod > "2020-12-31T00:00:00Z")' |
    jq -s 'sort_by(.lastmod)'

The resulting JSON object is:

[
  {
    "loc": "https://example.com/page-1",
    "lastmod": "2021-01-01T00:00:00Z"
  },
  {
    "loc": "https://example.com/page-2",
    "lastmod": "2021-01-02T00:00:00Z"
  }
]

The number of newer entries is the length of this array. Suppose you assign this object to a newer_list variable:

echo $newer_list | jq length

Request Daily Quota for URL Submission

Additionally, request the remaining daily quota for URL submission from Bing using the GetUrlSubmissionQuota method of the Bing Webmaster API. These methods require an API key. The DailyQuota key is in the d key of the JSON response.

curl -X GET "https://ssl.bing.com/webmaster/api.svc/json/GetUrlSubmissionQuota?siteUrl=https://example.com/&apikey=$api_key" |
    jq .d.DailyQuota

Add Newer Entries to URL List

Next, suppose you assign the URLs of newer entries to a url_list variable that contains comma-separated URLs for a JSON request. The number of these URLs should be less than or equal to the daily quota specified in the previous section. A loc variable is assigned the URL stored in the loc key at the index. A lastmod variable updates the date of the newest entry in the current submission. You can remove unnecessary quotes from the value of the lastmod key using the -r option.

index=0
while [ "$index" -lt "$newer_length" ] && [ "$index" -lt "$daily_quota" ]; do
    loc=$(echo $newer_list | jq .[$index].loc)
    lastmod=$(echo $newer_list | jq -r .[$index].lastmod)
    if [ -z "$url_list" ]; then
        url_list=$loc
    else
        url_list="$url_list, $loc"
    fi
    ((index++))
done

The resulting value of the url_list variable is:

"https://example.com/page-1", "https://example.com/page-2"

Submit URL List and Store Submitted Entry

Finally, submit the value of the url_list variable above to Bing using the SubmitUrlBatch method of the Bing Webmaster API.

curl -d "{\"siteUrl\": \"https://example.com/\", \"urlList\": [$url_list]}" \
    -H 'Content-Type: application/json; charset=utf-8' -X POST \
    "https://ssl.bing.com/webmaster/api.svc/json/SubmitUrlBatch?apikey=$api_key"

Additionally, store the value of the lastmod variable above as the date of the last submitted entry for the next submission that repeats from the beginning of this post.

Python Script Example

Combining the above processes, you can submit the URLs of the updated pages without manually specifying them. While this post discusses the process in the context of Bash scripting, the same principles apply to a real-world Python application. I have published the submit_urls.py Python script on GitHub, originally a Bash script.

No comments:

Post a Comment