<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY mdash  "&#8212;">
  <!ENTITY wj     "&#8288;">
]>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" submissionType="IETF" docName="draft-illyes-repext-00" category="info" ipr="trust200902" obsoletes="" updates="" xml:lang="en" symRefs="true" sortRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.22.0 -->
  <!-- Generated by id2xml 1.5.2 on 2024-08-01T16:00:46Z -->
	<front>
    <title abbrev="Robots Exclusion Protocol Extension for ">Robots Exclusion Protocol Extension for URI Level Control</title>
    <seriesInfo name="Internet-Draft" value="draft-illyes-repext-00"/>
    <author initials="G." surname="Illyes" fullname="Gary Illyes" role="editor">
      <organization>Google LLC.</organization>
      <address>
        <postal>
          <street>Brandschenkestrasse 110</street>
          <city>Zurich</city>
          <code>8002</code>
          <country>Switzerland</country>
        </postal>
        <email>garyillyes@google.com</email>
      </address>
    </author>
    <date year="2024" month="August" day="1"/>
    <workgroup>Internet Engineering Task Force (IETF)</workgroup>
    <abstract>
      <t>
   This document extends RFC9309 by specifying additional URI level
   controls through application level header and HTML meta tags
   originally developed in 1996. Additionally it moves the response
   header out of the experimental header space (i.e. "X-") and defines
   the combinability of multiple headers, which was previously not
   possible.</t>
    </abstract>
    <note>
      <name>About this Document</name>
      <t>
   This note is to be removed before publishing as an RFC.
   TODO(illyes): add commentable reference on github robotstxt repo.</t>
    </note>
  </front>
  <middle>
    <section anchor="sect-1" numbered="true" toc="default">
      <name>Introduction</name>
      <t>
   While the Robots Exclusion Protocol enables service owners to control
   how, if at all, automated clients known as crawlers may access the
   URIs on their services as defined by [RFC8288], the protocol doesn't
   provide controls on how the data returned by their service may be
   used upon allowed access.</t>
      <t>
   Originally developed in 1996 and widely adopted since, the use-case
   control is left to URI level controls implemented in the response
   headers, or in case of HTML in the form of a meta tag. This document
   specifies these control tags, and in case of the response header
   field, brings it to standards compliance with [RFC9110].</t>
      <t>
   Application developers are requested to honor these tags. The tags
   are not a form of access authorization however.</t>
       <section anchor="sect-1.1" numbered="true" toc="default">
         <name>Requirements Language</name>
         <t>
  The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
  "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY",
  and "OPTIONAL" in this document are to be interpreted as described
  in BCP 14 <xref target="RFC2119" format="default"/> [RFC8174] when, 
  and only when, they appear in all capitals, as shown here.</t>
        </section>
    </section>
    <section anchor="sect-2" numbered="true" toc="default">
      <name>Specification</name>
      <section anchor="sect-2.1" numbered="true" toc="default">
        <name>Robots control</name>
        <t>
          The URI level crawler controls are a key-value pair that 
          can be specified two ways:</t>
        <ul empty="true" spacing="normal">
          <li>an application level response header.</li>
          <li>in case of HTML, one or more meta tags as defined by the
              HTML specification.</li>
        </ul>
        <section anchor="sect-2.1.1" numbered="true" toc="default">
          <name>Application Layer Response Header</name>
            <t>
    The application level response header field name is "robots-tag"
    and contains rules applicable to either all accessors or
    specifically named ones in the value. For historical reasons,
    implementors should support the experimental field name also
    &mdash; "x-robots-tag".</t>
            <t>
    The value is a semicolon (";", 0x3B, 0x20) separated list of
    key-value pairs that represent a comma separated list of rules.
    The rules are specific to a single product token as defined by
    [RFC9309] or a global identifier &mdash; "*". The global identifier
    may be omitted. The product token is separated by a "=" from its
    rules.</t>
            <t>
    Duplicate product tokens must be merged and the rules deduplicated.</t>
            <artwork name="" type="" align="left" alt=""><![CDATA[
    ; key-values definition for the robots-tag response header.
    robots-tag = "robots-tag" ":" robots-tag-values
    robots-tag-values = *(value ";")
    value = ( global-product-token / ( product-token "=" ) ) [rule]
    global-product-token = "*" / OWS
    product-token =  1*( %x2D / %x41-5A / %x5F / %x61-7A )
    rule = "noindex" / "nosnippet"
    OWS = *( SP / HTAB )
    ]]></artwork>
            <t>
 For example, the following response header field specifies
 "noindex" and "nosnippet" rules for all accessors, however
 specifies no rules for the product token "ExampleBot":</t>
            <artwork name="" type="" align="left" alt=""><![CDATA[
    Robots-Tag: *=noindex, nosnippet; ExampleBot=;
    ]]></artwork>
            <t>
    The global product identifier "*" in the value may be omitted; for
    example, this field is equivalent to the previous example:</t>
            <artwork name="" type="" align="left" alt=""><![CDATA[
    Robots-Tag: noindex, nosnippet; ExampleBot=;]]></artwork>
            <t>
    Implementors should impose a parsing limit on the field value to
    protect their systems. The parsing limit MUST be at least 8
    kibibytes [KiB].</t>
        </section>
        <section anchor="sect-2.1.2" numbered="true" toc="default">
          <name>HTML meta element</name>
            <t>
    For historical reasons the robots-tag header may be specified by
    service owners as an HTML meta tag. In case of the meta tag, the
    name attribute is used to specify the product token, and the
    content attribute to specify the comma separated robots-tag rules.</t>
            <t>
    As with the header, the product token may be a global token,
    "robots", which signifies that the rules apply to all requestors,
    or a specific product token applicable to a single requestor. For
    example:</t>
            <artwork name="" type="" align="left" alt=""><![CDATA[
    &lt;meta name="robots" content="noindex"&gt;
    &lt;meta name="examplebot" content="nosnippet"&gt;]]></artwork>
            <t>
    Multiple robots meta elements may appear in a single HTML document.
    Requestors must obey the sum of negative rules specific to their
    product token and the global product token.</t>
        </section>
      </section>
      <section anchor="sect-2.2" numbered="true" toc="default">
        <name>Robots control rules</name>
        <t>The possible values of the rules are:</t>
          <ul spacing="compact">
            <li>
              <t>noindex - instructs the parser to not store the served
                data in its publicly accessible index.</t>
            </li>
            <li>
              <t>nosnippet - instructs the parser to not reproduce any
                stored data as an excerpt snippet.</t>
            </li>
          </ul>
          <t>
    The values are case insensitive. Unsupported rules must be ignored.</t>
          <t>
    Implementors may support other rules as specified in Section 2.2.4
    of [RFC9309].</t>
      </section>
      <section anchor="sect-2.3" numbered="true" toc="default">
        <name>Caching of values</name>
          <t>
    The rules specified for a specific product token must be obeyed
    until the rules have changed. Implementors MAY use standard cache
    control as defined in [RFC9110] for caching robots-tag rules.
    Implementors SHOULD refresh their caches within a reasonable time
    frame.</t>
      </section>
    </section>
    <section anchor="sect-3" numbered="true" toc="default">
      <name>IANA considerations</name>
      <artwork name="" type="" align="left" alt=""><![CDATA[
TODO(illyes):
https://www.rfc-editor.org/rfc/rfc9110.html#name-field-name-registry
]]></artwork>
    </section>
    <section anchor="sect-4" numbered="true" toc="default">
      <name>Security considerations</name>
      <t>
The robots-tag is not a substitute for valid content security
measures. To control access to the URI paths in a robots.txt file,
users of the protocol should employ a valid security measure relevant
to the application layer on which the robots.txt file is served &mdash;
for example, in the case of HTTP, HTTP Authentication as defined in
[RFC9110].</t>
      <t>
The content of the robots-tag header field is not secure, private or
integrity-guaranteed, and due caution should be exercised when using
it. Use of Transport Layer Security (TLS) with HTTP ([RFC9110] and
[RFC2817]) is currently the only end-to-end way to provide such
protection.</t>
      <t>
In case of a robots-tag specified in a HTML meta element, implementors
should consider only the meta elements specified in the head element
of the HTML document, which is generally only accessible to the
service owner[a].</t>
      <t>
To protect against memory overflow attacks, implementers should
enforce a limit on how much data they will parse; see section N
for the lower limit.</t>

    </section>
  </middle>
  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119" xml:base="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
            <abstract>
              <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        <reference anchor="RFC2817" target="https://www.rfc-editor.org/info/rfc2817" xml:base="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2817.xml">
          <front>
            <title>Upgrading to TLS Within HTTP/1.1</title>
            <author fullname="R. Khare" initials="R." surname="Khare"/>
            <author fullname="S. Lawrence" initials="S." surname="Lawrence"/>
            <date month="May" year="2000"/>
            <abstract>
              <t>This memo explains how to use the Upgrade mechanism in HTTP/1.1 to initiate Transport Layer Security (TLS) over an existing TCP connection. [STANDARDS-TRACK]</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="2817"/>
          <seriesInfo name="DOI" value="10.17487/RFC2817"/>
        </reference>
      </references>
      <references>
        <name>Informative References</name>
        <reference anchor="KiB" target="https://simple.wikipedia.org/wiki/Kibibyte">
          <front>
            <title>Kibibyte - Simple English Wikipedia, the free encyclopedia</title>
            <author>
              <organization></organization>
            </author>
            <date month="March" year="2006"/>
          </front>
        </reference>
      </references>
    </references>
  </back>
</rfc>
