Data Consumption Delay and Display Issue

Two weeks ago, our systems experienced a significant delay in data consumption, which resulted in a display issue for users. Specifically, when users were composing a post and considering adding a hashtag, the view count for that hashtag would show as 0 views, even if there were actually billions of views.

This display issue affected a total of 225,611 unique hashtags, but it had the potential to impact any searched hashtag. The root cause of this issue was inefficiencies in our data stream flushing service design.

We want to provide clarity on what exactly happened, how it happened, and why it happened. We will explain how our system returns information on hashtags to users, the steps we took to fix the issue in the short term, and our plans to prevent this problem from occurring again.

We became aware of the issue on May 28th, 2020, when users reported that view counts for numerous hashtags, including #georgefloyd and #blacklivesmatter, were showing as 0 views in the Compose screen.

Upon learning about the issue, we investigated two potential causes: a code issue or a data loading/lagging issue. Our findings revealed that the code was bug-free and provided eventual consistency. However, there was a lag in the data stream responsible for counting the number of times a video is viewed.

When we refer to a lag in the data stream, we mean that some data was backed up and needed to be queued for consumption. Normally, we calculate the necessary computing power to consume the queued data with minimal delay. However, in this case, the queued unprocessed hashtag data reached millions in scale, causing a significant backlog. As a result, the data could not be pushed online or back to the user in a timely manner. Even if the server eventually returned a value to the user, it would have been several days old.

The graph provided shows peaks and valleys in the data consumption slopes. These fluctuations may give the impression that the issue was resolving itself periodically, but they are actually a self-protection mechanism of the data stream infrastructure. When the lag becomes too severe, the system abandons all lagged data and starts fresh, resulting in a regular peak and sudden drop to 0 lag. This prolongs the overall lag time.

To understand why the hashtag count view data piled up, we need to examine how hashtag count view searches normally work and how the data is consumed. Under normal circumstances, the system uses a server proxy and a data stream flushing service to quickly process hashtag requests. Ideally, this returns a value immediately. However, in this case, the data flushing service failed, leading to the data lag and the system assuming there was no value to return (hence the 0 views).

The failure of the data flushing system was caused by unnecessary flushing of global hashtag attributes. The system flushed every attribute of a hashtag nearly 150 times, even though most attributes remained the same across all countries. This redundancy caused lags, such as the one that resulted in this issue.

To fix the issue, our engineering team bypassed the data stream flushing and implemented a thorough search procedure. This involved normalizing the query, back searching the hashtag library, and pulling the data in plaintext. By communicating between the server proxy and the hashtag library, we were able to retrieve the proper hashtag ID and view count. Although this method is slower, it effectively resolved the display issue for hashtags in the Compose screen. However, it did not solve the problem for uppercase or multi-case versions of hashtags. This is because we store all plaintext hashtags in lowercase.

To address the uppercase issue, we normalized all plaintext, case-sensitive permutations to lowercase before counting them in the hashtag library. We also implemented a final check and accounting process before returning the results to the user. This ensures that every time a user enters a hashtag in the Compose screen, they see an accurate view count for that hashtag.

Moving forward, we have examined the design of our data stream flushing and identified inefficiencies. We are upgrading the design to eliminate this issue. The new system will recognize identical attributes and avoid redundant flushing. For attributes that remain the same across all countries, we will flush them only once instead of 150 times per country. This will improve processing speed and reduce computing resource usage, preventing a similar lag issue from occurring.

Additionally, we will enhance our monitoring capabilities to detect data stream processing delays early on. This will allow us to address any issues promptly as they arise.