The Gig of Ham

One geek's contributions to the series of tubes

Jul 19, 2017 - 14 minute read - Comments - DevOps Days Austin Video OBS

DevOps Days Austin Video and Streaming Postmortem

I’m not going to sugar coat this, the streaming and video capturing for DevOps Days Austin 2017 was poor. It was also my responsibility. So I wanted to write this for two reasons: to apologize, and to explain what I am doing to fix it for next time.

Background

For 2016, DevOps Days Austin moved to the Darrel K Royal Memorial Stadium at the University of Texas Austin. We also expanded to three tracks. We didn’t have great access to the facility before the event that first year so we went super conservative and hired a third party to handle all the AV. It worked out fine, but was very expensive and the end result was a lot of work for me to get the videos on-line. I had approached the organizers to try and make things better for this year so we started planning.

We decided we would be able to do all the AV in-house with a fair amount of volunteer help. I put together lists of the equipment we would need after some basic testing at home. The number came out to be several thousand dollars less than what we paid to rent and outsource in 2016, and we wouldn’t have to spend this every year. Big spend this year, much less in later years as we upgrade/add equipment. So far, so good.

Side note: I’m putting together a complete inventory of hardware, software, configs, docs, etc. and all of this will be posted in a GitHub for others to learn from my mistakes, and replicate what works for other events. Do you have an event that is up to three stages in Austin and want AV? Talk to me! This will be updated with the links once it goes on-line.

This is where we hit our first problem: budgeting and logistics. We ordered most of our gear from Amazon, and a couple of the organizers wound up fronting the cost (around $15k) on their credit cards until we could get reimbursed from the conference funds. That…slowed things down. We also wound up getting the equipment very close to the show because that’s when funds we available. The original plan was to dry-run the whole mess at an DevOps Austin meetup three weeks before the conference.

You Amazon shipment has been delivered

In order to simplify things, we had wound up swapping out some components for what we could find in-stock at Amazon. For most things, this was fine. For a few (specifically the SDI converters) this was not. We also discovered Amazon doesn’t like you ordering more than about $5k at a time. Also, the joy of Amazon meant we got over 40 individual packages delivered. We congregated all the all the boxes from Amazon in a spare room at my place of employment one weekend and started assembling and testing.

Remember I said swapping the SDI converters was a bad idea? This is when we first encountered this. It was one business day before the meetup I was planning a full scale test at. Needless to say, we didn’t do that test and that was my second major problem: we only tested at my place of employment on the weekend and that didn’t reflect the venue at all in very specific ways.

The cheap Chinese SDI converters were sent back, and Black Magic Design ones were ordered for replacements (we were also using Black Magic Design capture cards so compatibility was guaranteed, there was a bit of premium for these but in the end playing it a little conservative caused less problems). Testing continued and we could now reliably source all the inputs in OBS. Some basic testing was done, but this is another place where I didn’t test enough: live streaming from a place of work is difficult. So I had a test screen on an Raspberry Pi and the camera pointed out the window. All audio was muted. I had looked at the video streams from my desktop, but never thought to check the audio (in my previous experience systems tend to prioritize audio over video because if you can hear fine but it’s a slideshow for the most part you don’t care).

Training

The weekend before the conference, we did some training - again at my place of employment over the weekend. We never got a chance to look at the streams from an “outside” view (even if “outside” was defined as “from my desk with headphones on”). I did see some stuttering on the main stage system once it was fully wired up, but we were dealing with (yet more) SDI converter issues (we didn’t upgrade the ones we were using for the preview projectors and should have) and making sure the local PA was OK. That week was the conference.

Setup Day

The day before the conference was setup day. We were able to get everything in and mostly setup within a few hours. We were finally able to do some testing in-situ as it were. There were problems. Our documentation about certain aspects of the venue were off, so some adjustments needed to be made there (location of projector inputs for one of the stages was the biggest issue). We tested all the RF equipment without issues (but in hindsight the room was missing several hundred attendees with various WiFi and Bluetooth devices). But most importantly, CPU utilization on the recording/streaming systems jumped quite a bit. Some investigation after the fact, I think this was due to the massive number of WiFi beacons was just causing interrupt requests from the USB3 WiFi dongles to spike outrageously. The main stage was showing it’s encoder was overloaded and running at only 7-8fps. That wasn’t going to be acceptable, so I went to our local Fry’s with a plan to upgrade but was met with the usual of what I wanted was out of stock. I grabbed what I could (which in hindsight I should have gone overboard just to be safe) and built upgrades that night. It took me so long to get everything together I didn’t do enough testing before getting a few hours sleep.

Day one

The new system for the main stage was installed before we went live, and wound up running at 8-9fps and the encoder was still overloaded. I tried several things to improve it but didn’t get much out. Tweets were coming in that the stream was unwatchable, and I finally turned the stream off in the afternoon and just recorded to disk from the main room. We had good reports from the second and third tracks, but I didn’t have much else to go off of.

We also had some audio issues in the morning because our PA testing the day before was for an empty room. Adding the attendees made the wireless mics drop out quite a bit. We wound up moving the receivers up to the stage and all of those issues went away at least.

Because the recording encoder defaults to the same as the streaming encoder, the recordings from the afternoon were no better than the streams in the morning. The same overloaded encoder was still dropping frames all over the place. VLC reported the full afternoon of recording yielded a 15min file. That was really disappointing.

That evening at the happy hour, the Youtube notifications started pouring into my phone of poor audio quality. I figure these were for the main stage, but they were for tracks 2 and 3. Turns out my assumption that the audio is used as a baseline and the video added on top was incorrect - the video is used as the baseline and audio drops to match the video. That was surprising and truly unfortunate.

Day two

Because of how poorly the previous day went in the main room, (and since I needed much more sleep) I went overboard and just brought in my personal gaming rig to capture on for the main track. Amazingly adding a really over powered machine optimized for gaming made all the problems go away on the main stage. I made adjustments on the 2nd and 3rd tracks to reduce the frame rate from 60fps (which was optimistic but matched all the inputs, thinking this would reduce CPU and GPU load) to 30fps and things looked better. I wasn’t able to check sound because of a lack of headphones but otherwise things looked good. Streams from the second day had some quality issues on the 2nd and 3rd tracks, but the main track was flawless.

The day went much better for all three tracks. Then we tore everything down, packed up, and went home.

What went wrong and how we are fixing it:

Here’s the complete list of what went wrong and how we’re approaching fixes.

Way too many WiFi beacons for the poor RTL8812 based USB3 network cards

The WiFi cards we purchased are based on the Realtek 8812 chipset. It’s a cheap device, and performs accordingly. It caused a lot of CPU load on the systems compared to wired Ethernet. My gaming rig has an ASUS USB3 WiFi adapter which had no issues at the event. I think it would be even better to move all the machines behind dedicated WiFi bridge for each track (same sort of device you use to connect an old Xbox to your WiFi for example). Real wired Ethernet would be best, but I know that is not going to happen for lots of reasons (mostly the venue cannot support it). Going to do some testing with hardware bridges so that everyone is still on WiFi and will allow adding devices which don’t have WiFi (like various HDMI switchers). The overhead of a busy WiFi be eaten by the device and only IP will move back and forth to the PCs. At least in theory. Otherwise just replace everything with non-USB WiFi adapters as part of equipment upgrades below…

Bugs in the release of the AMD Advanced Media Framework hardware encoding stack for OBS

I discovered a new release of the AMD drivers and the OBS AMD Encoder half way through day 1 of the conference. I applied them on the beginning of the 2nd day on the Stage 2 and 3 machines. These things happen. Windows 10 automatic driver “upgrades” hurt us here as well because it turns out it replaced some of the core files needed for the Advanced Media Framework with versions that had compatibility issues. The fix is to run DDU once which removes all broken drivers and then disables the Windows “update” service for graphics drivers after every major Windows update. I had Windows updates disabled by marking the conference WiFi as metered, which worked on site, but Windows 10 helpfully replaces drivers without asking and with the AMD stack that can leave them in a bizzare half-installed state. This was part of our problem I learned while debugging after the conference. Running DDU and installing the drivers from AMD caused CPU load to nosedive and the encoder to be stable in the same monitor configuration as we used in the conference. Lesson learned there.

Not enough hardware for projecting the OBS preview to a second monitor

I had hoped to have boxes that we could keep physically small so they would be easier to transport and store. This also let us use less expensive hardware to keep overall costs down. This meant we were using the AMD AM1 platform. We’re also in the middle of a massive hardware refresh in the AMD side of the world, and I didn’t want to spend a bunch of money on something we were going to need to replace next year anyway. There were no ITX boards which had nvidia GPUs, so I went AMD instead. (The Intel GPUs don’t seem to support encoder offloading for OBS so I stuck with the AMD ecosystem). I also did my initial testing on AM1 based systems, and since the GPU was doing most of the work they were fine. I knew they were under powered, but again GPU doing all the heavy lifting. Lots of head space on resource monitor. All of that was eaten by either the second monitor preview, WiFi, and mismatched driver components. I’m in the middle of massive testing here for improvements which leads to…

Not enough hardware for encoding at 1080p60

Again, trying to keep costs down and physical space requirements down meant the hardware was a little on the light side. The day 1 replacement hardware I got for the main stage from Fry’s was a near top of the range FM2 APU based ITX system. At first, that didn’t cut it either. Later I found that was mostly due to driver issues. My gaming rig (used on main stage, day 2) is a 8 core FX-8130 with a GTX 980 GPU and it worked flawlessly. But if future hardware needs a dedicated GPU and the capture card, then means limiting to a Micro-ATX boards. Much larger form factor and cost. Information about the new Ryzen APU parts (in the form of Linux kernel driver patches) are out and they are much more than I expected in terms of performance. I have done some research with an FM2 760K and a RX 460 and things are rock solid in the same display configuration at the conference main track. The Ryzen APU should be better still. Research on this is on-going and finding lots of settings to tweak is helping (it’s amazing what happens when you take OBS out of basic mode).

Lack of audio testing for Tracks/Stages 2 and 3

This is a stupid simple fix: adding headphones to all streaming machines. I facepalmed when I realized how stupid it was I didn’t add them earlier.

1080p60 was optimistic and we should have been shooting for 1080p30

Pretty self explanatory, and it’s an easy fix. We did this on day 2 for stages 2 and 3, stage 1 was just fine 1080p60 with the massive hardware load I had there. I went this route because the notebooks we were getting slides from were all 60Hz and the cameras were also 60Hz. I had falsely assumed that keeping everything at the same refresh rate would reduce load on te GPU. It looks better and all the equipment wants to be at 60Hz so we’ll shoot for it on the new hardware and see what happens, and fall back if we need head room on the OBS nodes.

Recordings were no better than the streams

We had the recording mode in OBS set to “same as stream” which meant whatever the streams looked like we got on disk. This turned out to be fatal for actually using it as a backup to the stream. Ideally I would like the recording to be lossless versions of the source streams. OBS does support this, but I don’t know what the resource requirements for the encoders are or what the drive space requirements are. Testing continues here as well, and adding drive space should be easy if we can pull off having the streams encoded two different ways simultaneously.

Changes in framing

I had originally designed the system to be a tight shot on the podium, and in reality we needed a wide area for panels and people to walk around. This mean zooming out. The awesome volunteers in Track 2 and 3 had issues where the camera was adjusting the iris to clearly view the projector screen above the speakers for slides and ignore the presenter. They figured out how to change the crop and things got much better. We need to do this on Stage 1, but the camera was off to the side for logistics reasons. I hope to correct this next year by hanging the camera from the ceiling so it doesn’t block doors and can be center aligned.

Operating system

As you may have noticed, everything was running on Windows 10. I had hoped to run on Linux, but the machines were too under powered to function in Linux. As much as a I dislike Windows, it’s media framework is superior to what is in Linux and OBS right now. Simply launching OBS and putting two Black Magic video streams on the canvas caused the CPU to peg at 100% in Linux. Doing that in Windows caused no 4% increase in CPU utilization. My hope is that the new hardware will have enough head room to allow running Linux instead of Windows (causes a lot of issues related to background processes, driver compatibility, and auto update processes to go away). Once the Ryzen APU parts start shipping, this is one of the things I plan on testing first.

Moving forward

Next time will be better. I can’t say how much better yet. I have a lot of testing that is ongoing until the next conference/meetup/whatever when we use this gear again. I am planning on using this gear in a variety of environments at least once a quarter between now and the next DevOps Days Austin 2018. This may also mean things don’t get much better until next DevOps Days Austin when budget opens and we can buy upgrades. We’ll see how it goes. It’s going to be better. Incremental improvements and testing. Stay tuned for updates.

comments powered by Disqus