记一次失败的转播

众所周知我推莫寒。关于结果并没有什么可说的；江东子弟多才俊，卷土重来未可知，聊以自慰。不论成绩如何，S队还是S队，莫寒还是莫寒，八月五日依旧会开开心心地看公演。

这篇博文我只想记录我转播总选时的一些教训，以供将来借鉴。进入技术话题，切换至英文模式。

So, yesterday I streamed the official YouTube broadcast onto my very own livestream. The motivation was to have a readily available backup VOD immediately after the event, in case the official channel did anything stupid to their version. Turns out this went horribly wrong. The official channel didn’t do anything stupid to their archived stream, and the last four hours were immediately available after the broadcast; some hours later, processing on the complete 7:16:03 version was done and made fully available. Meanwhile, my archive is still being processed after 24 hours, and only the last 1:35:01 is available at the moment; the total length shown in Video Manager also dropped from ~7:17 to 5:50:31 for whatever reason. (As for why mine is still being processed, I suppose YouTube priotizes popular streams.) In short, my backup was unnecessary and a disaster.

But my livestream went further than a backup. I didn’t expect to hit the top in Search, which attracted 5–10% of the total viewership. My stream peaked at ~400 concurrent viewers, compared to maybe ~6000 (or ~8000?) on the official stream. Also, the chatroom was sort of overtaken by Vietnamese fans — unfortunately I didn’t have the slightest clue of what they were talking about; I guess someone shared a link to my stream among them? Anyway, despite my effort to move people over to the official stream, especially after my stream started falling apart (which will be discussed in detail later), some people still stayed and I ended up with >100 at the lowest point.

Time to put on technical gear. Duplicating a YouTube stream is trivial with FFmpeg. A basic one-liner is

1	ffmpeg -re -i "$(youtube-dl -f 96 https://www.youtube.com/user/ChinaSNH48/live)" -bsf:a aac_adtstoasc -c copy -f flv rtmp://a.rtmp.youtube.com/live2/xxxx-xxxx-xxxx-xxxx

where 96 is the format code for 1080p HLS, and xxxx-xxxx-xxxx-xxxx is the secret key available from either the Live Dashboard or ingestion settings for an event with its separate stream.

Realistically, though, I needed to keep a local copy, and I was prepared to restart the job as quickly as possible in case the process died or had to be killed for whatever reason, so here’s the script I used in production:

#!/usr/bin/env zsh
# Source
channel=https://www.youtube.com/user/ChinaSNH48/live
format=96  # 1080p HLS stream
# Destination
endpoint=rtmp://a.rtmp.youtube.com/live2
key=xxxx-xxxx-xxxx-xxxx  # Actual key redacted
title='20170729 “我心翱翔”SNH48 Group第四届偶像年度人气总决选演唱会'
# Fetch stream URL
urlfile=/tmp/stream_url  # Persist stream URL in between sessions, in case the script needs to be restarted
poll_interval=10  # Interval for polling the live stream when waiting for the stream to begin
stream=
while [[ -z $stream ]]; do
    if [[ -f $urlfile ]]; then
        stream="$(<$urlfile)"
        [[ -z $stream ]] && rm $urlfile
    else
        stream="$(youtube-dl -g -f $format $channel)"
        [[ -n $stream ]] && echo -n $stream >$urlfile || sleep $poll_interval
    fi
done
# Stream to YouTube while keeping a local copy
index=0
while :; do
    # Do not overwrite any saved segment
    while :; do
        file="$title $index.ts"
        if [[ -f $file ]]; then
            (( index++ ))
        else
            break
        fi
    done
    ffmpeg -re -i $stream -bsf:a aac_adtstoasc -c copy -f flv $endpoint/$key -c copy $file
done

All went well for about five hours.^[1] The latency was below ten seconds. Then the official stream was interrupted for at most a few seconds, and all hell broke loose. The download seems to have picked up just fine, with FFmpeg showing a healthy speed of 1x; I’m not sure if that speed is instantaneous or average — I always thought it’s instantaneous, but when I experienced a similar interruption during the test stream a day earlier (the July 28 warm-up event), the speed gradually climbed from ~0.8x back up to 1x over the course of twenty minutes or so, so the speed may be more complex than just instantaneous, although it doesn’t seem to be a global average as well. I would read the relevant parts of FFmpeg source code if I were really curious. Anyway, whatever that speed means, the RTMP stream simply fell apart. The status on the Live Dashboard begins to cycle through a gray “offline — stream complete”, a red “live — video output low” (technically videoIngestionStarved), and the occasional yellow or green that couldn’t be maintained for more than a couple seconds. Reflected on the served stream, it was buffering all the time. I have no idea why a gap of few seconds could be so disruptive, considering network I/O isn’t remotely at capacity — I would imagine it should be easy to catch up. I would occasionally see a “last message repeated 1 times” message in FFmpeg’s output, but I never saw the actual messages…^[2] Maybe the messages were about dropped segments from M3U8 playlists? If that’s the case, and given I/O shouldn’t be a bottleneck, I would guess it’s -re's fault. -re is an input option for reading input at native frame rate; it’s necessary for streaming an archived stream (the first thing I tested), but I doubt its necessity when streaming another livestream — after all, the input can only be read at the native framerate, give or take a little. Whether dropping -re would be stability issues is something to investigate, but there’s a non-zero chance this could explain why FFmpeg just couldn’t catch up after the interruption.

The stream chugged along in the painful cycle for the better part of an hour. Then out of nowhere (segments were still being uploaded normally; I was monitoring my I/O) it got so bad that after three rounds of stopping and buffering the stream did not advance a single bit. I knew it was time to kill the process and accept defeat for my offline recording which was tied to the same process (a restart would apparently cause a short gap in the on-disk version). Somewhat ironically yet predictably, the stream went back to normal immediately. I should have made the hard decision much earlier.

The drama did not end there. YouTube actually expires their HLS feed URL after six hours. Therefore, after my stream went back to normal and my attention shifted back to the official stream (for the results, apparently), at some point FFmpeg just quitted with 403, and went on to fail in an infinite loop (see my script). The Live Dashboard was on my secondary machine by the side and I was only glancing at it from time to time, so I didn’t notice it until maybe twenty seconds later. I removed the expired /tmp/stream_url and kicked off the script again. In hindsight this seemed inevitable for a 7+ hour livestream,^[3] but at least the gap could have been shortened if (1) I was refreshing the cached stream URL, say, every hour in the background; (2) I stayed vigilant by issuing a notification whenever ffmpeg quits.

That’s pretty much it. To summarize, here are the lessons I learned:

The -re flag should probably be dropped (make sure to test the output stability before the next production run);
Decouple offline recording from streaming for flexibility. My bandwidth is more than enough to handle two simultaneous downloads;^[4]
Be decisive. If the stream can’t keep up, immediately kill and restart the transmission.
Refresh the cached stream URL every hour in the background so that we don’t scramble to fetch a new URL when the six hour expiration mark is hit;
Issue a notification with e.g. terminal-notifier whenever ffmpeg quits.

By the way, as a side effect, the channel’s number of subscribers saw crazy growth on July 29:

Growth in number of subscribers.

I’m currently entertaining the idea of livestreaming Theater performances from live.bilibili.com/48 too, in the future. If I want to do that though, I need to make sure it works completely unattended.

Time and durations are all approximate. I did not keep timestamped logs, and have since removed the botched local copy; in addition, analytics for the livestream isn’t available yet. ↩
I suppose either the messages didn’t end in a newline, or there was an output race condition. ↩
I don’t know a way to feed another input URL, unbeknownst to me in the beginning, to a running ffmpeg process. Maybe there’s a way with ffserver? No idea since I never touched ffserver. ↩
There might be a small cost in terms of latency. ↩