Coding Relic: CPE

Showing posts with label CPE. Show all posts

Friday, June 15, 2018

CPE WAN Management Protocol: transaction flow

Technical Report 69 from the Broadband Forum is a management protocol called the CPE WAN Management Protocol (CWMP). It was first published in 2004, revised a number of times since, and aimed at the operation of DSL modems placed in customer homes. Over time it has broadened to support more types of devices which an Internet Service Provider might operate outside of its own facilities, in the residences and businesses of its customers.

There are a few key points about CWMP:

It was defined during the peak popularity of the Simple Object Access Protocol (SOAP). CWMP messages are encoded as SOAP XML.
Like SNMP and essentially every other network management protocol, it separates definition of the protocol from definition of the variables it manages. SNMP calls them MIBs, CWMP calls them data models.
It recognizes that firewalls will be present between the customer premises and the ISP, and that the ISP can expect to control its own firewall but not necessarily other firewalls between it and the customer.
It makes a strong distinction between the Customer Premises Equipment (CPE) being managed, and the Auto Configuration Server (ACS) which does the managing. It does not attempt to be a generic protocol which can operate bidirectionally, it exists specifically to allow an ACS to control CPE devices.

A few years ago I helped write an open source tr-69 agent called catawampus. The name was chosen based mainly on its ability to contain the letters C W M P in the proper order. I’d like to write up some of the things learned from working on that project, in one or more blog posts.

Connection Lifecycle

One unusual thing about CWMP is connection management between the ACS and CPE. Connections are initiated by the CPE, but RPC commands are then sent by the ACS. Keeping with the idea that it is not a general purpose bidirectional protocol, all commands are sent by the ACS and responded to by the CPE.

tr-69 runs atop an HTTP (usually HTTPS) connection. The CPE has to know the URL of its ACS. There are mechanisms to tell a CPE device what ACS URL to use, for example via a DHCP option from the DHCP server, but honestly in almost all cases the URL of the ISP’s ACS is simply hard-coded into the firmware of devices supplied by the ISP.

Thus:

The CPE device in the customer premises initiates a TCP connection to the ACS, and starts the SSL/TLS handshake. Once the connection is established, the CPE sends an Inform message to the ACS using an HTTP POST. This is encoded using SOAP XML, and tells the ACS the serial number and other information about the CPE in the <DeviceId> stanza.

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:cwmp="urn:dslforum-org:cwmp-1-2"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:soap-enc="http://schemas.xmlsoap.org/soap/encoding/"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <soap:Header>
    <cwmp:ID soap:mustUnderstand="1">catawampus.1529004153.967958</cwmp:ID>
  </soap:Header>
  <soap:Body>
    <cwmp:Inform>
      <DeviceId>
        <Manufacturer>CatawampusDotOrg</Manufacturer>
        <OUI>ABCDEF</OUI>
        <ProductClass>FakeCPE</ProductClass>
        <SerialNumber>0123456789abcdef</SerialNumber>
      </DeviceId>
      <Event soap-enc:arrayType="EventStruct[1]">
        <EventStruct>
          <EventCode>0 BOOTSTRAP</EventCode\>
        </EventStruct>
      </Event>
      <CurrentTime>2018-06-14T19:34:47.297063Z</CurrentTime>
      <ParameterList soap-enc:arrayType="cwmp:ParameterValueStruct[1]">
        <ParameterValueStruct>
          <Name>InternetGatewayDevice.ManagementServer.ConnectionRequestURL</Name>
          <Value xsi:type="xsd:string">http://[redacted]:7547/ping/7fd86a7302ec5f</Value>
        </ParameterValueStruct>
      </ParameterList>
    </cwmp:Inform>
  </soap:Body>
</soap:Envelope>

Several fields are highlighted above: the EventCode tells the ACS why the CPE device is connecting. It might have just booted, it might be a periodic connection at a set interval, or it might be because of an exceptional condition. The ParameterList, also highlighted, is a list of parameters the CPE can include to tell the ACS about exceptional conditions.

The ACS sends back an InformResponse in response to the POST.

<soapenv:Envelope
    xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cwmp="urn:dslforum-org:cwmp-1-2"
    xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <soapenv:Header>
    <cwmp:ID soapenv:mustUnderstand="1">catawampus.1529004153.967958</cwmp:ID>
    <cwmp:HoldRequests>0</cwmp:HoldRequests>
  </soapenv:Header>
  <soapenv:Body>
    <cwmp:InformResponse>
      <MaxEnvelopes>1</MaxEnvelopes>
    </cwmp:InformResponse>
  </soapenv:Body>
</soapenv:Envelope>

If the CPE has other conditions to communicate to the ACS, such as successful completion of a software update, it performs additional POSTs containing those messages. When it has run out of things to send, it does a POST with an empty body. At this point the ACS takes over. The CPE continues sending HTTP POST transactions with an empty body, and the ACS sends a series of RPCs to the CPE in the response. There are RPC messages to get/set parameters, schedule a reboot or software update, etc. All transactions are sent by the ACS and the CPE responds.

<soapenv:Envelope
    xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cwmp="urn:dslforum-org:cwmp-1-2"
    xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <soapenv:Header>
    <cwmp:ID soapenv:mustUnderstand="1">TestCwmpId</cwmp:ID>
  </soapenv:Header>
  <soapenv:Body>
    <cwmp:SetParameterValues>
      <ParameterList>
        <ns2:ParameterValueStruct xmlns:ns2="urn:dslforum-org:cwmp-1-2">
          <Name>StringParameter</Name>
          <Value xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:type="xs:string">param</Value>
        </ns2:ParameterValueStruct>
      </ParameterList>
      <ParameterKey>myParamKey</ParameterKey>
    </cwmp:SetParameterValues>
  </soapenv:Body>
</soapenv:Envelope>

The ACS can send multiple RPCs in one session with the CPE. Only one RPC can be outstanding at a time, the ACS has to wait for a response from the CPE before sending the next.

When the session ends, it is up to the CPE to re-establish it. One of the parameters in a management object is the PeriodicInformInterval, the amount of time the CPE should wait between initiating sessions with the ACS. By default it is supposed to be infinite, meaning the CPE will only check in once at boot and the ACS is expected to set the interval to whatever value it wants during that first session. In practice we found that not to work very well and set the default interval to 15 minutes. It was too easy for something to go wrong and result in a CPE which would be out of contact with the ACS until the next power cycle.

There is also a mechanism by which the ACS can connect to the CPE on port 7547 and do an HTTP GET. The CPE responds with an empty payload, but is supposed to immediately initiate an outgoing session to the ACS. In practice, this mechanism doesn't work very well because intervening firewalls, like the ISP's own residential gateway within the home, will often block the connection. This is an area where the rest of the industry has moved on: we now routinely have a billion mobile devices maintaining a persistent connection back to their notification service. CPE devices could do something similar, perhaps even using the same infrastructure.

Saturday, October 21, 2017

On CPE Release Processes

Datacenter software is deployed frequently. Push daily! Push hourly! Push on green whenever the tests pass! This works even at extremely large scale, new versions of facebook.com are deployed multiple times each day (much of the site functionality is packaged in a single deployable unit).

CPE device software tends to not be deployed so often, not even close. There are several reasons for this:

Test practices are different.

Embedded systems is one of the oldest niches in software development and does not have a strong tradition even of unit testing, let alone the level of automated testing which makes concepts like push-on-green possible. One can definitely get good unit test coverage of code which the team developed, but the typical system will include a much larger amount of open source code which rarely has unit tests and is daunting for the team to try to add tests to. Much of the code in the system is only going to be tested at the system level. With effort and effective incentives one can develop a level of automated system test coverage... but it still won’t be close to 95%. System level testing never is, the combinatorial complexity is too high.

Additionally, with datacenter software, the build system creating the release is often somewhat similar to the production system which will run the release. It may even be the same, if the development team uses prod systems to farm out builds. A reasonable fraction of the system functionality can be run in tests on the builder.

With CPE devices, the build system is almost always not a CPE being tasked to compile everything. The build system is an x86 server with a cross-compiler. The build system will likely lack much of the hardware which is key to the CPE device functionality, like network interfaces or DRM keystores or video decoders. Large portions of the system may not be testable on the builder.
The scale is different.
Having a million servers in datacenters is a lot, that is one or more very large computing facilities capable of serving hundreds of millions of customers.

Having a million CPE devices is not a lot. There are typically multiple devices within the home (modem, router, maybe some set top boxes), so that is a couple hundred thousand customers.

It can simply take longer to push that amount of software to the much larger number of systems whose network connections will generally be slower than those within the datacenter. Multiple days is typical.
The impact of a problem in deployment is different.
If you have a serious latent bug which is noticed at the 3% point of a rollout within a datacenter, that is probably a survivable event. Customers may be impacted and notice, but you can generally quarantine those 3% of servers from further traffic to end the problem. The servers can be rolled back and restored to service later, even if remediation steps are required, without further impacting customers.

If you have serious latent bug which is noticed at the 3% point of a rollout within a CPE Fleet, you now have a crisis. 3% of the customer base is impacted by a serious bug, and will feel the impact until you finish all of the remediation steps.

If the remediation steps in 3% of a datacenter rollout require manual intervention, that will be a significant cost. If the remediation steps in 3% of a CPE Fleet deployment require manual intervention, it will have a material impact on the business.

We’ll jump straight to the punchline: How often should one deploy software updates to a CPE fleet?

In my opinion: exactly as often as it takes to not feel terrified at the prospect of the next release, no more and no less often than that.

Releasing infrequently allows requirements and new development to build up, making the release heavier and with more opportunities for accidental breakage. It also results in developer displeasure at having to wait so long for their work to make it to customers, and corresponding rush to get not-quite-baked features in to avoid missing the release.
Releasing too frequently can leave not enough time to fully test a release. Though frequent releases have the advantage of having a much smaller set of changes in each, there does still need to be a reasonable confidence in testing.

In the last CPE fleet I was involved in, we tried a number of different cadences: every 6 weeks, then weekly, then quarterly. I believe the 6 week cadence worked best. The weekly cadence resulted in a number of bugs being pushed to the fleet and subsequent rollbacks simply due to the lack of time to test. The quarterly cadence led to developers engaging in bad behavior to avoid missing a release train, by submitting their feature even in terrible shape. The release cadence became even slower, and the quality of the product noticeably lower. I think six weeks was a reasonable compromise, and left enough headroom to do minor releases at the halfway point as needed where a very small number of changes which were already tested for the next release could be delivered to customers early.

One other bit of advice: no matter what the release cadence is, once it has been going on long enough, developers will begin griping about it and the leadership may begin to question it (Maxim #4). Leadership interference is what led to the widely varying release processes in the last CPE fleet I was involved in. My only advice there is to manage upwards: announce every release, and copy your management, to keep it fresh in their minds that the process works and delivers updates regularly.

Friday, September 15, 2017

On CPE Cost

When it comes to the cost of hardware, volume matters more than anything else. To large extent, volume matters more than everything else put together. A cost efficient hardware design produced in low volume will be considerably more expensive than an inefficient and sloppy design produced in high volume. Plus, for a high volume product, the Contract Manufacturer will have engineering teams to help tighten the design for a moderate fee.

If your own sales volume is sufficient to get deep volume discounting, you can stop reading now (more honestly, you aren't reading this in the first place). Otherwise, if you are building a product for a new market or you are building for a niche, read on.

What does this mean? It means you should work very, very hard to use hardware which is produced in high volume. The compromises you would make in terms of RAM or other capabilities in order to get your own custom design down to a price you can tolerate will cost you far more than you saved in terms of updating the software and capabilities throughout the service lifetime. Using an existing, high volume design may bring other compromises, but it is a good tradeoff to make.

If you want to have your branding on the box: many commercial off the shelf (COTS) devices are available in unbranded white-box versions. It is simple and easy to add silkscreening or design flourishes, often a one-time design fee and a tiny line item on the Bill of Materials.

If you want to add RAM, Flash, moderately faster CPU, etc: most of those white-box products allow customization of specs which do not require changes in the board design. RAM and Flash suppliers offer different capacities in the same pinout, and CPU vendors offer multiple speed-bins of their chips. There will be a sweet spot in the market where the industry is buying the most volume, with a reasonable standard deviation such that you can moderately increase the capability without substantially increasing the cost. The converse is also true: moderate reductions in RAM/Flash/CPU don’t substantially decrease cost and may not be a good tradeoff.

If you want to have a unique industrial design: many ODMs will customize a product for you, including a new casing. It will need to fit the existing board, and will cost a few hundred thousand dollars for design, tooling, and emissions testing, but that is still cheaper than taking it all on in-house as you get the volume pricing for the board and other components.

Corollaries:

Mobile ate the world. You shouldn’t shy away from using mobile chipsets, even if your product will never operate on battery. Volume drives cost down, and mobile has the volume. Also, mobile chipsets with good power management are less in need of active cooling, and fanless is a huge win for consumer products.
RAM does cost money, but RAM is your future proofing. Greatly reducing RAM to lower cost is usually a bad tradeoff. Raspberry Pi Zero has 512 MBytes of RAM and costs US $10. Moderate amounts of RAM do not add much cost.
Many modern CPUs have configurable endianness, but seriously: little endian won. I hate that it won, but it did. If you’re considering a big endian toolchain, think carefully about the life choices that led you to that dark place. You’ll be taking endianness bugs onto your own plate for no benefit.

Monday, September 11, 2017

Musings On Customer Premises Equipment

I spent most of the period 2011 - 2017 building Customer Premises Equipment (CPE) for an Internet and television service provider. CPE gear is equipment which is installed in a customer’s residence/business/etc in order to provide an endpoint for whatever service the provider offers. For an Internet Service Provider this will be things like cable/DSL modems and Wi-Fi routers. For an electric utility, the remotely accessible meter could be considered the CPE. There are very few services which don’t require any sort of equipment on premises.

CPE devices are built very much like other consumer electronics, using the same components and software techniques. The biggest distinction is in the mindset for its development: the device you are building is not the product. The device you are building is an enabler, the actual product is the service being offered. It might seem a minor distinction but, done right, it shouldn’t be minor. It should influence the entire implementation.

For example, comparing consumer electronics built for sale to end-users to CPE devices used by service providers:

Devices for retail will build several variants, leveraging the same development and codebase but providing multiple price points for different market segments. They may also have slightly differentiated SKUs for large retailers to not compete directly with each other on price. For a service provider CPE, the incentive is the opposite: just one model to avoid inventory cost and CPE replacement for customers who upgrade. Multiple price points are provided by the service plan, which means the CPE will have capabilities which are not enabled for customers on the lower tier plans.
Devices sold to end-users will often cater to power users, as getting good reviews and positive press leads to increased sales in the vastly larger mid-tier market. They will have long lists of features, and try to check as many boxes as possible. For a service provider CPE, obscure features add support costs without attracting much additional market share. The CPE will usually target the 80% point of features which will be widely used. The kind of customer who really wants the power-user features tends to also be the most likely to try to Bring Their Own Device and make minimal use of the provided CPE, anyway.
For a CPE device, remote serviceability is a hugely important feature. Ability to resolve a customer escalation without having to send a technician is important for OpEx (Operational Expense). Even if the service provider charges for technician visits, it is at best defraying costs and not enough to incentivize wanting to send techs.
Though upfront capital cost is important, the service lifetime of CPE in the field is equally important. The value of CPE devices is in the ongoing service revenue they enable. It makes sense to leave headroom in the initial device to be able to support upgraded services in the future, and is financially preferable to having to retire the device earlier. Retirement either means mass replacement of customer devices, or writing off a portion of the customer base for any future offerings and just leaving them as is.

I found the work on CPE devices to be interesting. I hope to write about a few areas of it.