<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.29 (Ruby 3.4.4) -->
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-illyes-aipref-cbcp-00" category="info" submissionType="independent" tocInclude="true" sortRefs="true" symRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.29.0 -->
  <front>
    <title abbrev="cbcp">Crawler best practices</title>
    <seriesInfo name="Internet-Draft" value="draft-illyes-aipref-cbcp-00"/>
    <author initials="G." surname="Illyes" fullname="Gary Illyes">
      <organization>Independent</organization>
      <address>
        <email>synack@garyillyes.com</email>
      </address>
    </author>
    <author initials="M." surname="Kuehlewind" fullname="Mirja Kühlewind">
      <organization>Ericsson</organization>
      <address>
        <email>mirja.kuehlewind@ericsson.com</email>
      </address>
    </author>
    <date year="2025" month="July" day="07"/>
    <keyword>next generation</keyword>
    <keyword>unicorn</keyword>
    <keyword>sparkling distributed ledger</keyword>
    <abstract>
      <?line 38?>

<t>This document describes best pratices for web crawlers.</t>
    </abstract>
    <note removeInRFC="true">
      <name>Discussion Venues</name>
      <t>Source for this draft and an issue tracker can be found at
    <eref target="https://github.com/garyillyes/cbcp"/>.</t>
    </note>
  </front>
  <middle>
    <?line 43?>

<section anchor="introduction">
      <name>Introduction</name>
      <t>Automatic clients, such as crawlers and bots, are used to access web resources,
including indexing for search engines or, more recently, for new artificial
intelligence (AI) applications like training models. As crawling activity
increases, automatic clients must behave appropriately and respect the
constraints of the resources they access. This includes clearly documenting how
they can be identified and how their behavior can be influenced. Therefore,
crawler operators are asked to follow the best practices for crawling outlined
in this document.</t>
      <t>To further assist website owners, it should also be considered to create a
central registry where website owners can look up well-behaved crawlers. Note
that while self-declared research crawlers, including privacy and malware
discovery crawlers, and contractual crawlers are welcome to adopt these practices,
due to the nature of their relationship with sites, they may exempt themselves
from any of the Crawler Best Practices with a rationale.</t>
    </section>
    <section anchor="recommended-best-practices">
      <name>Recommended Best Practices</name>
      <t>The following best practices should be followed and are already applied by a
vast majority of large-scale crawlers on the Internet:</t>
      <ol spacing="normal" type="1"><li>
          <t>Crawlers must support and respect the Robots Exclusion Protocol.</t>
        </li>
        <li>
          <t>Crawlers must be easily identifiable through their user agent string.</t>
        </li>
        <li>
          <t>Crawlers must not interfere with the regular operation of a site.</t>
        </li>
        <li>
          <t>Crawlers must support caching directives.</t>
        </li>
        <li>
          <t>Crawlers must expose the IP ranges they are crawling from in a standardized format.</t>
        </li>
        <li>
          <t>Crawlers must expose a page that explains how the crawled data is used and how it can be blocked.</t>
        </li>
      </ol>
      <section anchor="crawlers-must-respect-the-robots-exclusion-protocol">
        <name>Crawlers must respect the Robots Exclusion Protocol</name>
        <t>All well behaved-crawlers must support the REP as defined in
<xref section="2.2.1" sectionFormat="of" target="REP"/> to allow site owners to opt out from crawling.</t>
        <t>Especially if the website chooses not to use a robots.txt file as defined
by the REP, crawlers further need to respect the <tt>X-robots-tag</tt> in the HTTP header.</t>
      </section>
      <section anchor="crawlers-must-be-easily-identifiable-through-their-user-agent-string">
        <name>Crawlers must be easily identifiable through their user agent string</name>
        <t>As outlined in <xref section="2.2.1" sectionFormat="of" target="REP"/> (Robots Exclusion Protocol; REP),
the HTTP request header 'User-Agent' should clearly identify the crawler,
usually by including a URL that hosts the crawler's description. For example:</t>
        <t><tt>User-Agent: Mozilla/5.0 (compatible; ExampleBot/0.1; +https://www.example.com/bot.html)</tt>.</t>
        <t>This is already a widely accepted practice among crawler operators. To remain
compliant, crawler operators must include unique identifiers for their crawlers
in the case-insensitive User-Agent, such as
"contains 'googlebot' and 'https://url/...'". Additionally, the name should clearly
identify both the crawler owner and its purpose as much as reasonably possible.</t>
      </section>
      <section anchor="crawlers-must-not-interfere-with-the-normal-operation-of-a-site">
        <name>Crawlers must not interfere with the normal operation of a site</name>
        <t>Depending on a site's setup (computing resources and software efficiency) and its
size, crawling may slow down the site or even take it offline altogether. Crawler
operators must ensure that their crawlers are equipped with back-out logic that
relies on at least the standard signals defined by <xref section="15.6" sectionFormat="of" target="HTTP-SEMANTICS"/>,
preferably also additional heuristics such as a change in the relative response time
of the server.</t>
        <t>Therefore, crawlers should log already visited URLs, the number of requests sent to
each resource, and the respective HTTP status codes in the responses, especially if
errors occur, to prevent repeatedly crawling the same source.</t>
        <t>Generally, crawlers should avoid sending multiple requests to the same resources
at the same time and should limit the crawling speed to prevent server overload, if
possible, following the limits outlined in the REP protocol. Additionally, resources
should not be re-crawled too often. Ideally, crawlers should restrict the depth of
crawling and the number of requests per resource to prevent loops.</t>
        <t>Crawlers should not attempt to bypass authentication or other access restrictions,
such as when login is required, CAPTCHAs are in use, or content is behind a paywall,
unless explicitly agreed upon with the website owner.</t>
        <t>Crawlers should primarily access resources using HTTP GET requests, resorting to
other methods (e.g., POST, PUT) only if there is a prior agreement with the publisher
or if the publisher's content management system automatically makes those calls when
JavaScript runs. Generally, the load caused by executing JavaScript should be
carefully considered or even avoided whenever possible.</t>
      </section>
      <section anchor="crawlers-must-support-caching-directives">
        <name>Crawlers must support caching directives</name>
        <t><xref target="HTTP-CACHING"/> HTTP caching removes the need of repeated access from crawlers to
the same URL.</t>
      </section>
      <section anchor="crawlers-must-expose-the-ip-ranges-they-use-for-crawling">
        <name>Crawlers must expose the IP ranges they use for crawling</name>
        <t>To complement the REP, crawler operators should publish the IP ranges they have
allocated for crawling in a standardized, machine-readable format, and keep this
information reasonably up-to-date (i.e., should not be outdated for more than 7 days).</t>
        <t>The object containing the IP addresses must be linked from the page describing the
crawler, and it must also be referenced in the page's metadata for machine
readability. For example:</t>
        <t><tt>
&amp;lt;link rel="help" href="https://example.com/crawlerips.json"&amp;gt;
</tt></t>
      </section>
      <section anchor="crawlers-must-explain-how-the-crawled-data-is-used-and-the-crawler-can-be-blocked">
        <name>Crawlers must explain how the crawled data is used and the crawler can be blocked</name>
        <t>Crawlers must be easily identifiable through their <tt>user-agent</tt> string, and they
should explain how the data they collect will be used. In practice, this is usually
done via the documentation page linked in the crawler's user agent. Additionally,
the documentation page should include a contact address for the crawler owner.</t>
        <t>The webpage should also provide an example REP file to block the crawler and a method
for verifying REP files.</t>
      </section>
    </section>
    <section anchor="conventions-and-definitions">
      <name>Conventions and Definitions</name>
      <t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
appear in all capitals, as shown here.</t>
      <?line -18?>

</section>
    <section anchor="security-considerations">
      <name>Security Considerations</name>
      <t>TODO Security</t>
    </section>
    <section anchor="iana-considerations">
      <name>IANA Considerations</name>
      <t>This document has no IANA actions.</t>
    </section>
  </middle>
  <back>
    <references anchor="sec-normative-references">
      <name>Normative References</name>
      <reference anchor="REP">
        <front>
          <title>Robots Exclusion Protocol</title>
          <author fullname="M. Koster" initials="M." surname="Koster"/>
          <author fullname="G. Illyes" initials="G." surname="Illyes"/>
          <author fullname="H. Zeller" initials="H." surname="Zeller"/>
          <author fullname="L. Sassman" initials="L." surname="Sassman"/>
          <date month="September" year="2022"/>
          <abstract>
            <t>This document specifies and extends the "Robots Exclusion Protocol" method originally defined by Martijn Koster in 1994 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers. Specifically, it adds definition language for the protocol, instructions for handling errors, and instructions for caching.</t>
          </abstract>
        </front>
        <seriesInfo name="RFC" value="9309"/>
        <seriesInfo name="DOI" value="10.17487/RFC9309"/>
      </reference>
      <reference anchor="HTTP-SEMANTICS">
        <front>
          <title>HTTP Semantics</title>
          <author fullname="R. Fielding" initials="R." role="editor" surname="Fielding"/>
          <author fullname="M. Nottingham" initials="M." role="editor" surname="Nottingham"/>
          <author fullname="J. Reschke" initials="J." role="editor" surname="Reschke"/>
          <date month="June" year="2022"/>
          <abstract>
            <t>The Hypertext Transfer Protocol (HTTP) is a stateless application-level protocol for distributed, collaborative, hypertext information systems. This document describes the overall architecture of HTTP, establishes common terminology, and defines aspects of the protocol that are shared by all versions. In this definition are core protocol elements, extensibility mechanisms, and the "http" and "https" Uniform Resource Identifier (URI) schemes.</t>
            <t>This document updates RFC 3864 and obsoletes RFCs 2818, 7231, 7232, 7233, 7235, 7538, 7615, 7694, and portions of 7230.</t>
          </abstract>
        </front>
        <seriesInfo name="STD" value="97"/>
        <seriesInfo name="RFC" value="9110"/>
        <seriesInfo name="DOI" value="10.17487/RFC9110"/>
      </reference>
      <reference anchor="HTTP-CACHING">
        <front>
          <title>HTTP Caching</title>
          <author fullname="R. Fielding" initials="R." role="editor" surname="Fielding"/>
          <author fullname="M. Nottingham" initials="M." role="editor" surname="Nottingham"/>
          <author fullname="J. Reschke" initials="J." role="editor" surname="Reschke"/>
          <date month="June" year="2022"/>
          <abstract>
            <t>The Hypertext Transfer Protocol (HTTP) is a stateless application-level protocol for distributed, collaborative, hypertext information systems. This document defines HTTP caches and the associated header fields that control cache behavior or indicate cacheable response messages.</t>
            <t>This document obsoletes RFC 7234.</t>
          </abstract>
        </front>
        <seriesInfo name="STD" value="98"/>
        <seriesInfo name="RFC" value="9111"/>
        <seriesInfo name="DOI" value="10.17487/RFC9111"/>
      </reference>
      <reference anchor="RFC2119">
        <front>
          <title>Key words for use in RFCs to Indicate Requirement Levels</title>
          <author fullname="S. Bradner" initials="S." surname="Bradner"/>
          <date month="March" year="1997"/>
          <abstract>
            <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
          </abstract>
        </front>
        <seriesInfo name="BCP" value="14"/>
        <seriesInfo name="RFC" value="2119"/>
        <seriesInfo name="DOI" value="10.17487/RFC2119"/>
      </reference>
      <reference anchor="RFC8174">
        <front>
          <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
          <author fullname="B. Leiba" initials="B." surname="Leiba"/>
          <date month="May" year="2017"/>
          <abstract>
            <t>RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.</t>
          </abstract>
        </front>
        <seriesInfo name="BCP" value="14"/>
        <seriesInfo name="RFC" value="8174"/>
        <seriesInfo name="DOI" value="10.17487/RFC8174"/>
      </reference>
    </references>
    <?line 194?>

<section numbered="false" anchor="acknowledgments">
      <name>Acknowledgments</name>
      <t>TODO acknowledge.</t>
    </section>
  </back>
  <!-- ##markdown-source:
H4sIAAAAAAAAA51Z63IbtxX+j6dAmZnIbknacpwbnRstK7Iay1IledJMpzMC
d0ESFnaxBbCkmYzfpQ/SX+2L9TsHeyEpaZppJmNzsYuDc/3Od+DRaCSiiVZP
5ODIq7XVXs50iLLyKosm02Eg1Gzm9QofZLOsGohMRb1wfjORppw7IXKXlaqA
gNyreRwZazc6jJSpvJ6PaMvo6VMR6llhQjCujJtK09ZcVxp/lFGUdTHTfiJy
CJ4IHPSZ+EQqr9VETi+Pp3hYO3+78K6uJvLnE/kznky5kCe0Im71Bq/ziZAj
WeoPUS50qb2KOIqW6tJkzvPPUCl/a2lnbkL0ZlZHnUur84X2Qqg6Lp0nMULi
P1OGiTwZy1M2h5fmtbXJ0hPlN9tvnF+o0vzKh07k6ZZt9FYXytiJDJtSZbc/
LLA3+WicuWLnuLOx/KnWS6vXcM/ekWfGv1fyp//8a/v17rnH3mQhkNn9oQVt
G992Un/QzUd8uCidL7B7Bb9jz+XxxUT6efb1Z0+/pufX19cXo6vjs+nb69Oj
q/Tq8PBp9+poevT69O1J++JQCMqITqIQo9FIqhl8jVwS4nppgkS21AU8I3Md
MsRAhy7fON0kBMi1nsksZWMYN3IKk+dWC2TDaRm9y+uMQyymdXR0YiYzayA4
DGWos6VUoRMhVZnLmaNXSCtZB4Q9OqkynBf4MK+Dqz0ehzAhs3VOWUI5+oF+
kEpBKw+pulyYElo6P5SFgzCvMxxqN0P+qtRrHBHN3GRGWciK2lqDjMy0fDQ9
fSxVVVmTccCCtOZWSzjHlHRK4XJtw1hOG8VpjWpwZeKGtEJBBE0m7BssixoO
nOmlWmk6wLvKG9SS3bDhsK3SWZRxqUWGY/lA7HJzWupNp6dN45Sx5GAlX+BV
ZmE+5LXRI92Wbi14S6ZKnC4NJTwsh3PpWLwmicYnzQy8035Yzm1NLsnpGA2Y
gCOHoomWdBWVr6Owwb8q3KZozZ21SeQeQLHjO4+5OuJvncNj+HYr4ZBH15BS
e0jwEBuAART7YKKWbg3MgG9NlGHpagsTbHCkLHkMlvmkBAUBnytBQffKwnsL
wpKNXJMhe/LYYOvcrawrvLJ2lIKU98kt37qo4UYFXZbGauSZnY9ynVlFRyI4
Ke/aDUPZ5yeivFJZCnKh7Bo7BJAtcysNhfod9B5mcBXW0LkvC9bYAgk010Pu
Kk6ToHvvDkVe81tyfKlijT0pcxBZr21K5aWBgSYuJVmPIzkvCrWR+oMuktAC
lq2Al3PvCqi0afOv7TovKagXXVBZmpIJyJXVBAOfyEsNZQvC1nxvA8GLbpKE
nLOXI01UZ+0nTZJyhlkENd+k0sT6DD/FSmF7od47j+ojVRGPhR6FDKr0DnQl
mwBA0r7UEZB3OG4Nasoy1FXlfNyvRHnpCJDk8QdEk9oiLHHRZc6OxbN9GdAa
tW9Qf22NqRnUiEs0wMWyiQVQDWm9IGil5lYuxuKzfUGli5Iwyc85WcnFCQIW
NexrKo+UgcGKYzkWzx+yKFPZMjVTYCAhPqD68/2P9YfKBZ28dIFolosOaLzu
q5aTAhWLQyM8pXxufkUoUjMZiy8eEKtkBYsllw+WLIAttMDTRCmXoBVKAggY
9ltkMrFFo5l1GSCG8+uTvWN+V8DQgqzl8m4wOB9l9zqMpRxfUGfK9ZxACiaL
33670tzK5LPxs/EhuR4fffzIFcmQt40oWKQiBcoln7UehPrHpCu6DqVJKq0W
jbKlg7cChx8CavacZ3vGEYxpTsDTayVQAY2uwz7XW+QsdYLCbd/c/HWUxI2i
WtxIk8qCOIJcora0v9e7/19aw92hQ3k66kEHPnowZC/oi8dD0Wnp9T9qAoyk
rTx4h1NHUzr1oIWOtgU2um62cswPRR1q9jxc1wO0ku8u36TsXLoQw/aWg9Aw
oIpUH8sf0cT0B1VUlpjTTa8AqJ/7FYRRPfl8/FQ+Av5VqFE46gUM4+9fuvjk
6fjwhfzTMsYqTJ48Wa/X40YY8bwncMN4GQv7+Gbc0DD83+EegCBnsoDWXxEl
bmFTqsLBjDuNGW2b4g+GWQrSxxpVxuHd71KYGxJBTBxO7mmCT507hblNM9Hk
TgauM0I1a3RfwhbZO6Tjd2JAXY1r/mDh3MJq2HnAJX7QeqL29sl4PD4YgFfl
uUnNhNhaamdofLvhFV14IWu5HbBUgSzdIJRV7RMEkZWJbhJBg/gZXIlXgWJ0
b94/gMLMxe19ICzEKx4omN+UzSISKOgIYsEpUTMl67kcqRncPBIpkHpOfBSM
a/O41V8EAOywB2Dq1YHAJoeZrE5CHSTlSmNBgakCM918TnWH3IluoQkPOmQW
e2FH5IgrcPbvxpihHwVnqgrJxvbPMBmNCNSsW4DY0iYBcmE091hIQHRCwpq2
QUDBBWLZYylKr4eCw8/HX5ADd0eYjx+HgsZSKEpRYo6nurRA9dcebA4DUjdC
KIAnNa0W0xLjWTFtrkB8YKAptGi4DHJ0xWDX09re6ibRYGFXeitDXs4JJkKT
kjwLk+YNJFGUS4JtodFvuwgnVtfw9yr13wRl8E+sQT0d8fZO66QsDtHbXUJo
7ylgLstqDDRAdXhnRed5JByYbm43fY6whVwyrALMPOFRm+tp30y1ciYn3Tlr
i9pGAzTqrWoYJcvr0lao2K+SY1MeN44zhYl9RZJY2JKaUat2CoAkBmydyodk
Y1uLwy12SFJY3m4raTt01XKxPdDoFW10olKekQGjlm1Ehw49jxqQfprr+30D
MWhlTe/MgblLbBH90NdE9p5cQIl1SmzbjSGjokn5aO8k0k/FmDg45plNhcGH
JsgloVzW4AzOSENRGohb9YjYD0VbCRhwSi7PkpoHaQTeBw8fTS+uj15PU1Xj
JRr2kEQSNpNqhgZ8EMWc2dpmDZegXZaWTiLOBmSKVIsLT7Gskac9JO6MU/eY
hwmoUN7YzZbqDQCi4cOVXBEnx9edB1MMPaMlairZXQDJXB7kIz1ejIfy4vzq
Gn++u34M8OnYFFlHeIAznU/q8j1Gp2xVz6wJS0JC3zKwbu0gdA4pVAk+w3vD
JiA2/UTPZVkAa4kqUHehleR68We1UlfMF6SvS/TgreLjdEa+43tmuTMevLLU
FLY2dlOQyBAtulzabA+4Ldpz7RI041xN9bTVzu50s4fHAQFuu31PBErG8Wi/
BINAoSZWxKySMz3hThvQnuQm9is6eABm3ttdH544iPdu3xXwhQDzlxSNfc67
RWTafEvhvE84EX9BfD1j9XfuJO4MNkMEmXygR9QHmPWmUSfB+q3WFV9e9Pdp
qIotelFXo+hGdF8qH5mxRs7u4hEwLe/U4GsqdNRSfolZaBMep/4k3ew98feG
Q7WoCLPQEVEkNDC0NB1W0C0MB4PTmqau5gKv2dhe3wwbhpH2trco3HP5yqeF
WRKBokDpKR7QWNPkFJGcYixG7zvE+OZGfGrjC9KImvG3g6W21UAucQJ+N6xv
m/s2ehng43v4b/DpIr5gMfemDo2Q/3uC3KaFu2PkFkb9/hnnhoacEQ85N82U
0/X3Tdtq9pVjpdL9G9oaRXJteARlNdF9yo7HD9NNGFvAY4rIHUjcyqgkqbkh
S2nGsW0C3pLxbmTpp7G91igeENQo384AKqUblG2SrB0Bdll2k6GA/20hnExo
zStDkso2K7hj8whLHY7CsCOQr3kaiBd0GgAN/J7Stt0Y0uXSkSupmfLVLO16
RcSSTWxul27hbPq3hiAHZ++urgfD9Ld8e86/L4//8u708vgV/b56PX3zpvsh
mi+uXp+/e/Oq/9XvPDo/Ozt++yptxqrcWRKDs+kvg5QTg/OL69Pzt9M3A7l/
xcktOLp0x4r5AuyAoTSI9rKdQ/ry6OLf/zx8Drb8h8sfj54dHn4NZE4PXx1+
+RwPBPzpNG6A6ZGTUYGzK8+IhmTLVGUiwjIkgoAoYXSgRgl3/vFv5Jm/T+Q3
s6w6fP5ds0AG7yy2PttZZJ/dXbmzOTnxnqV7jum8ubO+5+ldfae/7Dy3ft9a
/OZ7noZGh199/53gHML8UfOV4VHTVlWbP+evzru3/Onp9O307mc78VwqurRJ
X6pEyNp/DqGJiaRMs9vSEUotaEcQv00SbdT5t4M5QqMHH5vDVfclAvRfAniD
TfgbAAA=

-->

</rfc>
