Internet-Draft | PIM Upstream Deterministic ECMP | March 2025 |
Fenner & Kumar | Expires 3 September 2025 | [Page] |
In densely interconnected networks, a PIM node may have many choices as to what upstream neighbor to send a JOIN message to, for a given source and group. This document describes a mechanism for multiple nodes (e.g., leaf nodes in a data center) to pick the same upstream node (e.g., spine node) to avoid redundant traffic flows.¶
This note is to be removed before publishing as an RFC.¶
The latest revision of this draft can be found at https://fenner.github.io/pim-deterministic-ecmp/draft-fenner-pim-deterministic-ecmp.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-fenner-pim-deterministic-ecmp/.¶
Discussion of this document takes place on the Protocols for IP Multicast Working Group mailing list (mailto:pim@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/pim/. Subscribe at https://www.ietf.org/mailman/listinfo/pim/.¶
Source for this draft and an issue tracker can be found at https://github.com/fenner/pim-deterministic-ecmp.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 September 2025.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In a densely interconnected network, there may be many equal-cost paths to a given source or RP. RFC7761 is silent on the issue of how to choose among these, just indicating that RPF_interface(S) and RPF_interface(RP) have a single answer. If different leaf routers make different choices, then traffic can flow over extra paths.¶
This document introduces two mechanisms: one for two-tier networks and one for arbitrary multi-tier networks, to allow routers to make the same decision of which neighbor to use in an ECMP scenario. This eliminates undesired redundant traffic flow.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
In this document, the hash algorithm used is Bob Jenkin's one-at-a-time hash. This is a very high quality, but fast hash function. Wikipedia has one description of the algorithm. This hash function is defined on sequences of octets; it is performed across all of the addresses given in network byte order.¶
Pseudocode like hash( address1, address2, address3 )
conceptually
lays out these addresses adjacent to each other in memory in network
byte order, and performs a single hash operation across all 12 octets.¶
+---+---+---+---+---+---+---+---+---+---+---+---+ | address1 | address2 | address3 | +---+---+---+---+---+---+---+---+---+---+---+---+¶
We use the [RFC6395] Hello Option to allow multiple routers to hash a given (S,G)
join to the same RPF neighbor. The procedure consists of two phases: first,
we compute hash( S, G, routerId )
for each eligible RPF neighbor, and select the
highest hash value among this list. If there are multiple entries with the highest
hash value, we re-hash among this sub-list using hash( S, G, local-information )
,
and use the highest single resulting hash value. If multiple hops still hash to
the same value, we simply take the first one in the list. This results in no
coordination between nodes, since each node may have a different order for the
list.¶
The local-information
is a value local to the router that can be influenced by the
deployment to have the same result between multiple peers - e.g., it could be an
interface name, and the deployment on multiple routers uses the same interface to
connect to the same peer. It could also be a locally-configured value on each
interface, which results in higher configuration overhead but more deployment
flexibility.¶
viaMultipathRouterId( source, group, vias ) bestHash = 0 bestVias = [] for via in vias: routerId = getRouterId( via ) curHash = hash( source, group, routerId ) if curHash > bestHash: bestVia = [ via ] bestHash = curHash else if curHash == bestHash: bestVia.append( via ) bestHash = 0 if len( bestVia ) == 1: return bestVia[0] for via in bestVia: curHash = hash( source, group, local-information ) if curHash > bestHash: bestVia = via return bestVia
pseudocode format TBD¶
We describe a Hello Option to exchange "Color"Should we come up with a different name for this? It has nothing to do with SR-TE color, for example. However, RFC5512's definition of Color is just as abstract as this. "MultiTier ID"?, an abstract notion of grouping of nodes. For example, in a 3-tier network, the routers in the middle tier could be colored by the spine to which they connect in the top tier. In this way, the color value presented to the leaf routers by the middle tier is a proxy for the routers in the top tier.¶
This Hello option should only be advertised "downwards" towards the lower levels of the tree.¶
The PIM Hello Option used to exchange Color values is shown below.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = TBD | Length = 4 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Color | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Type: TBD (see Section 8 and Appendix A)¶
Length: In bytes for the value part of the Type/Length/Value encoding. The Color will be 4 bytes long.¶
Color: The color value being advertised by this router.¶
TBD: If not all neighbors advertise color, do we pick from the subset that do, or we fall back to Section 4?¶
We use the above Hello Option to add an initial round of hashing,
falling back to the algorithm in Figure 1 to break ties.
With this mechanism, the first round is to calculate
hash( S, G, color )
for each eligible RPF neighbor, and select
the highest hash value among this list. If there are multiple entries
with the highest hash value, we re-hash among this sub-list
with viaMultipathRouterId
defined above.¶
viaMultipathColor( source, group, vias ) bestHash = 0 bestVias = [] for via in vias: color = getNeighborColor( via ) curHash = hash( source, group, color ) if curHash > bestHash: bestVia = [ via ] bestHash = curHash else if curHash == bestHash: bestVia.append( via ) bestHash = 0 if len( bestVia ) == 1: return bestVia[0] return viaMultipathRouterId( source, group, bestVia )
pseudocode format TBD¶
TODO Security¶
IANA is requested to allocate a value from the PIM-Hello Options registry as shown:¶
Value | Length | Name | Reference |
---|---|---|---|
TBD | 4 | Color | This Document |
This non-normative appendix describes the Arista Proprietary Color option, for the benefit of compatibility with the deployed base. A standards-compliant implementation SHOULD NOT emit or parse these options by default, but MAY have a configuration option to emit and parse these options on a given interface for interoperability.¶
A pair of PIM Hello options is required for compatibility with the deployed base of Arista EOS. Both types are allocated from the "Private Use" reserved range. The first option, with type 65001, only serves to indicate with a "magic number" that the type 65002 option is indeed the Arista Proprietary Color option (as opposed to some other Private Use).¶
These option formats are shown below.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 65001 | Length = 4 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 4028514875 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
By including this PIM Hello option, with type 65001 and 32-bit value 4028514875, you indicate that the rest of the PIM Hello options that you are including are Arista-proprietary.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type = 65002 | Length = 4 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Color | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The Arista Proprietary Color option is identical to the option described in figure Figure 2, except for the Type field. It is only recognized if the Arista Proprietary Hello options are enabled by the option above.¶
The Arista-compatible hash algorithm stores the color in little-endian byte order when hashing.¶
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | address1 | address2 | color(little-endian) | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+¶
When all routers in a set of vias are exchanging color information via the RFC-specified COLOR option, they MUST use the standard hash with the color in network byte order. When at least one router in a set of vias is exchanging color information via the Arista Proprietary Color option, they MUST use the Arista-compatible hash algorithm to compare colors.¶
Hashing with Router-ID:¶
hash( 192.0.0.2, 224.1.1.1, 10.0.0.1 ) = 3391492512 hash( 192.0.0.2, 224.1.1.1, 10.0.0.2 ) = 3391493567 hash( 192.0.0.2, 224.1.1.1, 10.0.0.3 ) = 3391498574¶
In this case, the neighbor with Router-ID 10.0.0.3 would be chosen.¶
Hashing with Color, Arista-compatible:¶
hash( 192.0.0.2, 224.1.1.1, 10 ) = 3391495569 hash( 192.0.0.2, 224.1.1.1, 20 ) = 797801633 hash( 192.0.0.2, 224.1.1.1, 30 ) = 2218733534¶
In this case, the neighbor with color 10 would be chosen.¶
Hashing with Color, RFC-compatible:¶
hash( 192.0.0.2, 224.1.1.1, 10 ) = 4240030715 hash( 192.0.0.2, 224.1.1.1, 20 ) = 4240032301 hash( 192.0.0.2, 224.1.1.1, 30 ) = 4240042647¶
In this case, the neighbor with color 30 would be chosen.¶
TODO acknowledge.¶