At the SFU Satellite Design Team, we are taking part in the Canadian Satellite Design Challenge. As part of the challenge, we were offered the chance to put some of our hardware in a proton beam at the TRIUMF facility, Canada’s national lab for nuclear and particle physics. This was a really unique opportunity, and one we were excited to use to the best of our ability.
We designed a test for our onboard computer (OBC), which has the following features:
- TMS570 microcontroller (ECC flash & RAM, lockstep architecture)
- external flash memory
- real time clock
- external watchdog
Primarily, we were interested in a qualitative understanding of how the system performed under the proton beam. The test software we developed had the following functions:
- MCU self-tests (RAM, self-test controller, various peripherals)
- initialization routine and RTOS control
- reason for reset reporting
- external flash reads and writes
- time stamping of all data in flash
- serial communication of all flash reads/writes for logging on computer
- frequent watchdog kicks
Radiation can effect hardware in two different ways. First, single event upsets (SEU) are simple bit flips caused by a high energy particle. They don’t typically cause permanent damage, but do result in corrupted data. They are fixed by resetting the appropriate bit. ECC memory, for example, is an appropriate way to deal with SEU effects.
Single event latchups are a more serious issue. In these cases, a high energy particle will alter the silicon in such a way that a P-N-P-N junction will be created. This is known as a parasitic thyristor. Thyristors can’t be turned off by any means except removing power to the thyristor (there’s no gate). In practice, this means that a chip that has SEL’d needs a power-on (hard) reset. If this reset happens fast enough, damage can be avoided. SEL events essentially create a short inside the chip, meaning current consumption will increase significantly. Therefore, we can attempt to protect against latchup by monitoring the current draw and toggling the power if the current suddenly spikes drastically. On the SFUSat OBC, we use the INA301 current sense amplifier and a FET to do this.
The main idea is that the computer will be running fairly standard functionality (reading and writing to flash). We write, then read flash to determine if any upsets have happened due to radiation, and we print the values (at time of read) to the serial port for logging. All writes to flash are time tagged in the flash itself, and writes happen at a reliable frequency. This allows us to determine whether the RTC is still operational.
The watchdog is external, although there is an internal one on the MCU. Should the system lock up due to an SEU or a latchup, the watchdog will power-on reset the MCU. This triggers the self-tests, which utilize the diagnostic structures for testing the chip at the fab, but allow us to access them with software. The idea is that these self-tests are a very fast and efficient way to toggle nearly all of the transistors on the chip, hopefully clearing any temporary upsets that may have triggered the reset in the first place. Since the reason for reset is saved and reported at startup, we can see when this happens on the monitoring computer.
Overall, it was a simple test, but one that was representative of the typical types of things that the OBC will need to do. There were a few limitations, which were not implemented due to time constraints:
- no checking of number of errors automatically corrected
- no self-test of CPU core compare module
- no verification of flash integrity before writing
- no use of internal watchdog
- no test of command/response from the host computer
- no monitoring of power consumption for silicon degradation effects
- no dedicated latchup detection or mitigation
It would have been great to get to these features. Particularly, seeing how many errors (in RAM, for example) had been automatically corrected by the chip’s ECC would have been very interesting. Overall, the intent was to test macro functionality, to determine that our OBC design could operate in a radiation rich environment, not to determine what exactly the effects are. The good news is that we can functionally test most of these items without the need for a cyclotron. For example, TI provides an API to inject errors into ECC, which will allow us to test the ECC mechanism itself. Silicon degradation is very interesting to me, and it was suggested that the power consumption will increase with the total dose of radiation received. It would be very interesting and valuable to study this, and I hope we can in the future!
Despite the limitations on the scope of our test, it will give us valuable information. Primarily, we aim to establish whether the OBC can perform in low Earth orbit for a period of 1 year, and correct any errors that should arise from radiation effects. Since the reliability features used during the test are only a subset of those to be used on the actual satellite, we can reasonably expect higher reliability under the same test conditions once those features are implemented.
To monitor the data coming back from the OBC as it was under test, I developed a simple test script in Python to take in the serial stream and time-tag each message with the UNIX epoch. This allowed us to view the data coming back in real time to determine if anything had happened to the OBC as it was under test. Time-tagging with a very accurate epoch might allow us to see if the RTC clock drifted significantly. However, the system is not deterministic, so we would most likely only be able to detect very large drifts.
Of course, all of the incoming data was saved into a file as well. I will do a tutorial on writing a script like this one, as it’s a pretty quick and easy way to do a comprehensive test of an embedded system like our OBC. Also, it featured automatic reconnection, which was done in a clever way thanks to Python.
Going in to the test, I really did not know what to expect. To a piece of plastic, I mounted a TMS570Ls043 LaunchPad, the OBC demo board (for the RTC and watchdog), and an external breakout I made for some new flash memory, since the chips on the demo board had been EOL’d by Cypress.
It turned out that we were testing in TRIUMF’s proton therapy facility, which is typically used to treat eye cancer. They had removed the patient chair, and there was an X-Y stage that we could mount our boards to. After selecting an appropriately sized collimator, we aimed the beam at the MCU onboard. We decided to primarily irradiate the MCU since we were mostly interested in its reliability features. If we had more time, it would have been interesting to test the RTC and flash more specifically.
I really got the sense that the cyclotron is a living being - and one that straddles the line between being under our control, and out of it. Earlier that day, they had to shut the system down because of a power supply failure. And as they were bringing the beam up to start our test, I watched the numbers in 4 quadrants change, as an operator steered the beam to centre it. You can almost imagine someone driving this beam around, pulling levers and things. It’s not really like that, but all of the older looking control consoles give that kind of vibe. It’s interesting, having cutting edge research (and our simple test), and very well engineered (though old) equipment working together.
Digression aside, watching the beam come up to energy was really cool. When they opened the shutter, I was initially nervous that the OBC would kick the bucket very early on in the test. That didn’t happen, and it kept going, and going, and going, as more protons were channeled towards it. About 2/3 the way into the 20 minute test, we had what looked like an SEU event, as we got back some garbled data. More on this in another post after I analyze things further.
Aside from the SEU event though, the system performed flawlessly, and we noticed no data corruption and no resets during the test. It was also interesting to note the difference in capability between our TMS570, and a standard STM32 part. The initial dose is measured in an arbitrary calibration unit they call MC. Our test finished around 351,000 MC (3 krads), and the system was still going strong. The STM32 tested by another team only was able to handle 6,000 MC (0.05128 krads).
This test was an extremely valuable opportunity for the team, and I’m very happy with how it went. Clearly, there are more things we should implement for the next time a radiation test rolls around, which would make it that much more valuable. I’d also like to extend thanks to the CSDC management team, and the folks at TRIUMF who stayed up late and volunteered to help us run our tests.