<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-liu-msr6-problem-statement-01"
     ipr="trust200902">
  <front>
    <title abbrev="Problem Statement of MSR6">Problem Satement of IPv6
    Multicast Source Routing (MSR6)</title>

    <author fullname="Yisong Liu" initials="Y." surname="Liu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>liuyisong@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Tianji Jiang" initials="T." surname="Jiang">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>1525 McCathy Blvd.</street>

          <city>Milpitas,</city>

          <region>CA</region>

          <code>95035,</code>

          <country>United States of America</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>tianjijiang@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Toerless Eckert" initials="T." surname="Eckert">
      <organization>Futurewei</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>tte+ietf@cs.fau.de</email>

        <uri/>
      </address>
    </author>

    <author fullname="Zhenbin Li" initials="Z." surname="Li">
      <organization>Huawei Technologies</organization>

      <address>
        <email>lizhenbin@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Gyan Mishra" initials="G." surname="Mishra">
      <organization>Verizon Inc.</organization>

      <address>
        <email>gyan.s.mishra@verizon.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Zhuangzhuang Qin" initials="Z." surname="Qin">
      <organization>China Unicom</organization>

      <address>
        <email>qinzhuangzhuang@chinaunicom.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Changwang Lin" initials="C." surname="Lin">
      <organization>New H3C Technologies</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>linchangwang.04414@h3c.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Xuesong Geng" initials="X." surname="Geng">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>gengxuesong@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="21" month="October" year="2022"/>

    <abstract>
      <t>This document analyses the gaps of the existing IPv6 multicast
      solutions under discussion in IETF based on the requirements.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Multicast could provide efficient P2MP service without bandwidth
      waste. The increasing amount of live video traffic in the network bring
      new requirements for multicast solutions. The existing multicast
      solutions request multicast tree-building on control plane and
      maintaining end-to-end tree state per flow, which impacts router state
      capacity and network convergence time. There has been a lot of work in
      IETF to simplify service deployment, in which Source Routing is a very
      important technology, including SRv6, BIER, etc. Source routing is able
      to reduce the state of intermediate nodes and indicate multicast
      forwarding in the ingress nodes, which could simplify multicast
      deployment. Source routing requires sufficient flexibility on the
      forwarding plane and IPv6 has the advantage with good scalability.
      Therefore, it is important to simplify multicast deployment and meet
      high quality service requirements with IPv6 Source Routing based
      multicast.</t>

      <t>The MSR6 WG will focus on use cases identifed in <xref
      target="I-D.liu-msr6-use-cases"/> with the following set of
      characteristics:</t>

      <t>- Large network scale with numerous multicast service</t>

      <t>- IPv6 multicast flow transmitting through Internet with requirement
      of encryption</t>

      <t>- IPv6 Host Initiated or overlay Multicast Transport</t>

      <t>According to these usecase this document analyses the problem of the
      existing IPv6 multicast solutions under discussion in IETF. To solve
      these problems, MSR6 can be used as a complementary multicast
      solution.</t>
    </section>

    <section title="Problem Statement for Multicast of Large-scale Network">
      <t>In large network scale with numerous multicast service, there are
      scalability issues if using existing multicast solutions.</t>

      <t>Based on the use case document, 2 typical scenarios are considered as
      an example:</t>

      <t><list style="symbols">
          <t>Multicast for 5G transport, e.g., with1.5k egress nodes, 10k
          multicast services;</t>

          <t>Multicast for DCN, e.g., with 3k switches, 60k links, 1k
          multicast services;</t>
        </list></t>

      <t>If PIM/mLDP/P2MP RSVP-TE are used in these cases, per-flow state
      protocols are used to set up multicast tree, which request period state
      refresh and corresponding protocol message. Multicast stream status are
      maintained in the intermediate nodes. When there are thousands of
      concurrent multicast services, per-flow status will bring scalability
      issues for network device, especially when the multicast tree is
      dynamic.</t>

      <t>BIER/BIER-TE(<xref target="RFC8279"/>) is introduced in order to
      avoid explicit multicast tree building and per flow status in
      intermediate nodes. But there is challenge for BIER in a large scale
      network. Bit position allocation for BIER is related to the scale of
      network topology. The number of bit position affects BIFT size and
      bitstring length directly. When there are too many egress nodes/links in
      the network, encapsulation expanse and entry numbers of BITF could be
      unacceptable. If Several SDs or SIs are divided, too many copies,
      excessive traffic redundancy, similar to degradation to head-end
      replication.</t>

      <t/>

      <t>For example, if BIER defined is used for P2MP tunnel in the network,
      bit position should be allocated for all egress nodes, i.e., 9k bit
      positions for all possible leaves a. Most of the bit positions are 0 and
      only few of them are set in some sparse multicast example. In this case,
      the BIER Header is inefficient and the encapsulation expense is
      unacceptable. Considering that the number of bit position also
      determines the BIFT entry size, forwarding speed may also be
      affected.</t>

      <t>There are some possible methods to improve the situation in BIER. For
      example "set" could be used to save the cost of bit position, but
      multiple packets are supposed to be sent when the BFR-ID of the
      receivers belong to different set. And when the network size is large,
      the usefulness of set is not obvious. In the case showed above, even 10
      Sets are planned, there needs about 9 hundreds bit positions for each
      packet and different set requests different BIFTs in each node.</t>

      <t>In BIER-TE, bitstring need to carry bits to indicate not only the
      receiving BFER but also the intermediate hops/links across which the
      packet must be sent. For the most common case, bit position should be
      allocated for all adjacencies. About 100k bit positions are requested.
      The bit position representing adjacencies that the multicast tree goes
      through are set and the rest of the bit positions are set to 0. In the
      example above, 7 bit positions are set in the bitstring. BIER-TE header
      is less efficient and the encapsulation expense is more significant,even
      compared to BIER. Also controller is supposed to allocate different
      BIFTs for 10k nodes;</t>

      <t>Some methods defined in BIER-TE is introduced to improve the
      situation. "Set" could also be used, but not enough as the analysis
      above. There are some other methods for reducing the number of required
      bits, such as unicast (forward_routed()), ECMP() or flood (DNC) over
      "uninteresting" sub- parts of the topology, which brings different kinds
      of limitation for path planning.</t>

      <t>Since the exiting BIER/BIER TE cannot satisfy the requirement of
      multicast in the large-scale network, it need to introduce the new
      source-routing-based solutions for the multicast TE. There can be
      possible solutions defined in the drafts. It need to introduce the new
      source-routing-based solutions for the multicast . There can be possible
      solutions defined in the existing drafts. The basic idea is combination
      of RH Segment list and bistring to specify the multicast path. The
      existing BIER header cannot satisfy the requirement of encapsulating
      such information. Instead IPv6 Route Header combining with other IPv6
      extension header can serve the purpose well. The possible encapsulation
      is shown in the following figure.</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+      ---
     |          IPv6 Header           |       
     +--------------------------------+ IPv6 Multicast TE Tunnel Header
     |IPv6 RH (Segment List/Bitstring)| 
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+
]]></artwork>
        </figure></t>

      <section title="Typical Scenario in DCN">
        <t>In order to better show the requirements in data center, we list 3
        typical potential multicast scenarios with P2MP services: AI training,
        HPC and Storage.</t>

        <t>The multicast requirements for large-scale is expressed in 3
        aspects:</t>

        <t>- Network Scale: number of switches, number of links, number of
        hosts</t>

        <t>- Multicast Tree Size: number of intermeidate nodes; number of
        receivers</t>

        <t>- Multicast Service Number</t>

        <section title="AI Training">
          <t>The following figure shows a typical RDMA AI training
          scenario.</t>

          <t><figure>
              <artwork><![CDATA[                 PS(Parameter Server) Nodes
               +-------+          +-------+
               |  CPU  |          |  CPU  |
               | Server|          | Server|
               +-+-+-+-+          +-+-+-+-+
    ^            | | |              | | |          |
    |         +--|-|-|--------------+ | |          |
    |       +----+ | +----------------------+      |
    |       | |    +--------+ +-------+ |   |      V
Gradients   | |             | |         |   | Parameters
        +---+-+-+       +---+-+-+     +-+---+-+
        |  GPU  |       |  GPU  |     |  GPU  |
        | Worker|       | Worker|     | Worker|
        +-------+       +-------+     +-------+]]></artwork>
            </figure></t>

          <t>Worker-&gt;PS: The gradient of each worker is pushed to PS
          node</t>

          <t>PS-&gt;Worker: PS will pull the parameters back to all workers
          after aggregation</t>

          <t>In this process, the second stage is information distribution,
          with the same data content. N connections are used to transmit
          unicast separately. The bandwidth efficiency is 1/N, the larger the
          scale, the lower the efficiency.</t>

          <t><figure>
              <artwork><![CDATA[                      +---------------+
                      |     Source    |
                      | +---+   +---+ |
                      | |CPU|   |GPU| |
                      | +-+-+   +-+-+ |
                      |   |       |   |
                      |    \     /    |
                      |   +-V---V-+   |
                      |   |  HCA  |   |
                      |   +-------+   |
                      +--+-+-+-+-+-+--+
                         | | ... | |
                      +--V-V-----V-V--+
                      |     Switch    | 
                      +-+-----------+-+
                       /             \
        +-------------V-+           +-V------------+
        |  Destination  |           |  Destination  |
        |   +-------+   |           |   +-------+   |
        |   |  HCA  |   |           |   |  HCA  |   |
        |   +-V---V-+   |           |   +-V---V-+   |
        |    /     \    |           |    /     \    |
        |   |       |   |           |   |       |   |
        | +-+-+   +-+-+ |           | +-+-+   +-+-+ |
        | |CPU|   |GPU| |           | |CPU|   |GPU| |
        | +---+   +---+ |           | +---+   +---+ |
        +---------------+           +---------------+]]></artwork>
            </figure>If the source only sends 1 copy to the network and the
          switches replicate the packet to different distinations. The use of
          bandwidth is more efficient and the training is faster.</t>

          <t>The large-scale multicast requirement in this scenario is as the
          following:</t>

          <t>- Network Scale: 10-10k GPU</t>

          <t>- Multicast Tree Size: 10-10k receivers</t>

          <t>- Multicast Service Number: depends on the scenario</t>
        </section>

        <section title="HPC">
          <t>The following is an example of MPI in HPC scenario.</t>

          <t><figure>
              <artwork><![CDATA[      +-------------------------------------------+
      |                Dispatcher                 |
      |                  Master                   |
      +---------------------+---------------------+
                            |
          +-----------------+
          |  
      +---+----+  +--------+             +--------+
      |+--V---+|  |+------+|             |+------+|
      ||Dispa-||  ||Dispa-||             ||Dispa-||
      ||Agent ||  ||Agent ||             ||Agent ||
      |+---+--+|  |+---+--+|             |+---+--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      ||  MPI ||  ||  MPI ||     ...     ||  MPI ||
      ||Proces||  ||Proces||             ||Proces||
      |+---^--+|  |+---^--+|             |+---^--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      || RoCE |<-->| RoCE |<------------->| RoCE ||
      |+------+|  |+------+|             |+------+|
      +--------+  +--------+             +--------+]]></artwork>
            </figure></t>

          <t>Stage 1: Dispatcher Master senses millions of cores and schedules
          millions of Rank MPI jobs on demand. Dispatcher Master sends the
          scheduling results to Dispatcher Agent</t>

          <t>Stage 2: Dispatcher Agent starts Million Rank MPI on each node
          The Dispatcher Agent that receives the message broadcast the message
          to other Dispatcher Agents and do the initialization before starting
          the MPI application</t>

          <t>Stage 3: Dispatcher Agent broadcaast the message to start the MPI
          application. MPI internal initialization Synchronize the RoCE
          endpoint in allgather way after the MPI application is started</t>

          <t>The last 2 stages could benefit from multicast and reduce task
          completion time.</t>

          <t/>

          <t>The large-scale multicast requirement in this scenario is as the
          following:</t>

          <t>- Network Scale: 1000 k CPU/GUP</t>

          <t>- Multicast Tree Size: 10k~100k receivers</t>

          <t>- Multicast Service Number: 1~100</t>
        </section>

        <section title="Storage">
          <t>Ceph is an open-source distributed software platform. It mainly
          focuses on scale-out file system including storage distribution and
          availability, which is widely used in storage.</t>

          <t>Ceph Object Storage Daemons (OSDs) are reponsible for storing
          objects on a local file system on behalf of Ceph clients. Also, Ceph
          OSDs use the CPU, memory, and networking of Ceph cluster nodes for
          data replication, erasure coding, recovery, monitoring and reporting
          functions.</t>

          <t>The following process request P2MP service.</t>

          <t>- Application initiates "write" operation from a client to a
          server.</t>

          <t>- Client finds the server to write in, and 3 copies are sent to 3
          services.</t>

          <t><figure>
              <artwork><![CDATA[               +-------+          +-------+
               |Client1|          |Client2|
               +---+---+          +---+---+
                   |                  |
                   +---------+--------+
                             |
                     +-------+-------+
                     |     Switch    | 
                     +-------+-------+
                             |
            +----------------+----------------+
            |                |                |            
        +---+---+        +---+---+        +---+---+
        | Server|        | Server|        | Server|
        +-------+        +-------+        +-------+]]></artwork>
            </figure></t>

          <t>The large-scale multicast requirement in this scenario is as the
          following:</t>

          <t>- Network Scale: 3k Server (1 Pod)</t>

          <t>- Multicast Tree Size: 3 receivers</t>

          <t>- Multicast Service Number: 10k</t>
        </section>
      </section>
    </section>

    <section title="Problem Statement for IPv6 Multicast with IPSec ">
      <t>In the typical scenario like IPv6-based SDWAN, the multicast traffic
      may traverse the Internet through the IPv6-based multicast tunnel. At
      the same time the traffic must be encrypted for the purpose of security.
      IPSec can be adopted for encryption.</t>

      <t>The independent layer design of BIER brings the following
      challenges:</t>

      <t>Option 1: If the IPv6 IPSec extension header is used for the reason
      of security (shown in the following figure), the BIER header will be
      encrypted and the traffic steering information cannot be acquired by the
      BIER nodes. That is, the BIER cannot work in this option.</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+      ---
     |          IPv6 Header           |       ^
     +--------------------------------+       |
     |  IPv6 IPSec Header (ESP & AH)  | IPv6 Multicast Tunnel Header
     +--------------------------------+       |
     |           BIER Header          |       |
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+
]]></artwork>
        </figure>Option 2: In order for BIER Header to work while implement
      the security function, a new security header may have to be introduced
      for the BIER layer (shown in the following figure). This means: 1) that
      the existing IPv6 IPSec extension header cannot be reused; 2) There can
      be conflicted functions in the two layers: IPv6 layer and BIER
      layer.</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+      ---
     |          IPv6 Header           |       ^
     +--------------------------------+       |
     |          BIER Header           | IPv6 Multicast Tunnel Header
     +--------------------------------+       |
     |       New Security Header      |       |
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+
]]></artwork>
        </figure></t>

      <t>For MSR6, which is designed based on native IPv6, it is allowed to
      reuse IPv6 Authentication header and Encapsulating Security Payload
      header. If MSR6 is used in this case, the packet is supposed to
      encapsulated as the following to implement end to end multicast
      security:</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+      ---
     |          IPv6 Header           |       ^
     +--------------------------------+       |
     |  IPv6 EH (MSR6 EH or Options)  | IPv6 Multicast Tunnel Header
     +--------------------------------+       |
     |  IPv6 IPSec Header (ESP & AH)  |       |
     +--------------------------------+      ---
     |            Payload             |
     +--------------------------------+
]]></artwork>
        </figure></t>

      <t>Just as IPsec, there are other existing functionalities that have
      been in IETF based on IPv6, for example fragmentation, network slicing,
      IOAM etc, which could all be reused in MSR6 which is based on IPv6 data
      plane. Comparingly, it has to be defined again if these functions/header
      are supposed to be used in BIER, which brings redundancy.</t>
    </section>

    <section title="Problem Statement for IPv6 Host-initiated Multicast">
      <t>In the IPv6 host-initiated multicast scenarios, the host will
      originate the IPv6 packet to be replicated for the different leaf hosts.
      The packet originated by the host may have the format shown in the
      following figure. The packet has the encapsulation of IP layer and
      Transport Layer.</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+       ---
     |          IPv6 Header           |    IP Layer
     +--------------------------------+       --- 
     |          UDP Header            | Transport Layer
     +--------------------------------+       ---
     |            Payload             |
     +--------------------------------+
]]></artwork>
        </figure></t>

      <t>If BIER is adopted for the multicast traffic steering, the
      independent layer design of BIER may make the packet originated by the
      host as follows. This violates the layer architecture of the Internet,
      that is, it introduces an extra layer (BIER layer). This does not work
      in the host.</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+       ---
     |          IPv6 Header           |    IP Layer
     +--------------------------------+       --- 
     |          BIER Header           |   BIER Layer
     +--------------------------------+       ---
     |          UDP Header            | Transport Layer
     +--------------------------------+       ---
     |            Payload             |
     +--------------------------------+
]]></artwork>
        </figure></t>

      <t>For MSR6, multicast traffic steering information will be encapsulated
      in the IPv6 extension header shown in the following figure. It can still
      maintain the layer architecture of the Internet.</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+       ---
     |          IPv6 Header           |
     +--------------------------------+    IP Layer
     |   IPv6 EH (MSR6 EH or Options) |
     +--------------------------------+       ---
     |          UDP Header            | Transport Layer
     +--------------------------------+       ---
     |            Payload             |
     +--------------------------------+
]]></artwork>
        </figure></t>

      <t>Besides, multicast source routing requests no explicit multicast tree
      set up protocols. The network device replicates and forwards the packet
      just based on the MSR6 header encapsulated by the host.</t>
    </section>

    <section title="Summary">
      <t>In summary, in order to satisfy the requirements of the usecase
      characterized as follows,</t>

      <t>- Large network scale with numerous multicast service</t>

      <t>- IPv6 multicast flow transmitting through Internet with requirement
      of encryption</t>

      <t>- IPv6 Host Initiated or overlay Multicast Transport</t>

      <t>according to the analysis of problems of the existing multicast
      solutions, MSR6 solution should be introduced to take the advantages of
      IPv6 extension header to encapsulate the extensible multicast traffic
      steering information and reuse the existing IPv6 encapsulations like
      IPSec. There can be unified encapsulation for the IPv6 tunneled packet
      and the IPv6 host initiated packet. The abstract MSR6 header is shown in
      the following figure:</t>

      <t><figure>
          <artwork><![CDATA[     +--------------------------------+
     |          IPv6 Header           |
     +--------------------------------+
     |IPv6 RH (Segment List/Bitstring)|
     +--------------------------------+
     |    IPv6 EH (MCAST Options)     |
     +--------------------------------+
     |  IPv6 IPSec Header (ESP & AH)  |
     +--------------------------------+
]]></artwork>
        </figure></t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include='reference.RFC.8279'?>

      <?rfc include='reference.RFC.8296'?>

      <?rfc include='reference.I-D.cheng-spring-ipv6-msr-design-consideration'?>

      <?rfc ?>

      <?rfc include='reference.I-D.liu-msr6-use-cases'?>

      <?rfc include='reference.RFC.8663'?>

      <?rfc ?>
    </references>
  </back>
</rfc>
