Clearing Fog and Mapping Clouds

Analyzing the Threat Posed by Mega CDNs in Federated Networks

CloudFlare is a CDN-like service that I have talked about the perils of before including when they got included as the default DNS over HTTPS root in Firefox. They act as an intermediary for sites and intercept connecting web traffic and most importantly they decrypt it before it reaches the actual destination (CloudFlare gets access to all supposedly secure data) and also provide other "services" which conveniently place them in a position where they can intercept private data. So in the context of a supposedly fairly decentralized network like the Fediverse, how much do we really need to worry about "secure" connections eventually passing through CDNs like CloudFlare rather than their actual destinations and revealing private data to these sketchy mega-corporations? This post is a preliminary attempt to find out.

Process of identifying the extent of the problem

I do not currently know a way of getting an easy list of a ton of instances in simple plaintext, so here I will be using all of the instances of the people I follow, which comes out to a total of 271 instances. Mastodon allows exporting a CSV file in the format user@instance.tld,true where true is whether to display their boosts (never knew that was an option). There are multiple users on the same instances, and I also don't want anything but the instance in my file so I can get a list of instances. Here is where my nerdiness shows (if it somehow didn't hit you already). I used two (stock) emacs commands to trim the file down to a deduplicated list of instance URLs, so here they are for reference:

M-x replace-regexp <RET> .*?@\(.*\),true <RET> \1 <RET>
M-x delete-duplicate-lines <RET>

Then after deleting the top line which had the column names the file was ready. I would need a bash script that could take in a bunch of URLs as arguments and check whether they used (were used by) CloudFlare. Well, this is basically just checking for empty results from grep, but results from what? That took a second, but after a little bit I found out that CloudFlare announces itself in the HTTP response headers as Server=cloudflare, so I crafted a curl command that outputs the headers, and simply grepped for "cloudflare" in order to check each instance. To start it using the list from the file, I just ran ./cfinder.sh $(cat instances.txt) (the $(command) syntax executes that command in place so it's output can be used as part of a larger command). At least they nicely announce themselves instead of forcing a check against their known IP ranges to identify them.

Here is the bash script which takes any number of domains as it's arguments and determines whether they use CloudFlare. It checks for CloudFlare's signature in the headers using curl and grep then outputs a result and even keeps a total count for you at the end.

The heart of the script is mostly just the below line of code.


curl -sSL -D - $i -o /dev/null | grep cloudflare

cfinder.sh script This script is also very slow, not for use on large datasets. Just good enough for quick checks.

I added some coloring for terminals that support that as well. The output from my first run was actually quite green overall (only 14/271 positive for CloudFlare disease, ~5%)! But maybe it would be more helpful to weigh the results by user instead, so I did the same thing except I skipped the deduplication step on the input in emacs. This time it took much longer to run (if I had the patience I'd code a better solution to this, but eh, whatever). Of the accounts on my follow list now, 82 out of 1339 are hosted on CloudFlare (~6%). A mere 1% increase would be heartening, if it weren't my follow list, because generally I tend not to follow people who would be on CloudFlare instances to begin with and a few times it has made me reconsider following people or accepting their follow requests.

How do we interpret this? (skip this if you don't like math)

We can analyze the data to look for patterns in terms of how many nodes out of the total need to compromised for 50% to 100% of a network's connections to be compromised. This is graph theory, so if you aren't familiar then maybe watch a quick numberphile video.

The Fediverse can be modeled as a complete graph (undirected), meaning every node connects to every other node, for simplicity of the math and also since at the moment federating openly is the default behavior. When we ask ourselves how many connections there are (edges) in a graph we can use what are called triangle numbers, which should be familiar to computer scientists at least even if not by name. The formula is n(n+1)/2, but mathematicians have their own special notation which is on the Wikipedia page. You can think of the basic formula as behaving like a factorial, except instead of the terms getting multiplied n! = 1*2*3*...*n the terms are added, n? = 1+2+3+...+n (Knuth suggested the n? notation, but nobody seems to use it).

There are two types of compromised connections we need to consider: the connections between two compromised nodes and the connections between a compromised node and an uncompromised node. Both are just as bad, so we should add them together to get the number of compromised connections out of the total connections. We then get the total connections with the above formula n? where n is the number of nodes (I'm using Knuth's notation or else this will look overwhelming). To get the connections between the compromised connections and the uncompromised ones, we can denote the compromised connections as c and the uncompromised as u and simply multiply them together c*u. Lastly, to get the connections between the compromised nodes we will use c? because the uncompromised nodes are their own complete subgraph of the larger graph. All together, this comes out as (c?+cu)/n?, but we can also describe it as (c?+cu)/(c+u)? (removing n) and ((cu+(c(c+1)))/2)/(((c+u)((c+u)+1))/2) if you hate yourself (might have to double-check that last one, lol). The output of these functions are between 0 and 1, with 0 being fully secure and 1 being fully compromised.

To give you an idea of how bad this is, the output of this function for both of my runs (weighted and unweighted) are both around 0.05 and 0.057, both roughly corresponding to the 5% and 6% from the raw fractions before. But now that we have a function, we can plot it! With a function plot we can judge how much the problem of compromised connections scales as the number of compromised nodes grows. But what would be a good size for our Fediverse as the total number of nodes for our function (we can vary the compromised and set the uncompromised accordingly)? Fediverse.network at the time of writing has 3,128 Mastodon and Pleroma instances listed, so let's go with that.

Before I move on, though, there is an assumption I have to admit I've hidden in the math, if I write a function like this with a constant number of instances where the compromised determines the number of uncompromised in the function, the assumption is that going forward pre-existing instances will either switch to using CloudFlare or be replaced by CloudFlare instances rather than the Fediverse getting only new compromised instances. Keep in the back of your mind that because of this the function I am making is a worst-case scenario I am using to demonstrate the very worse number of compromised connections possible for a given number of compromised nodes. So, to conclude, I'm finally reducing the function to one term so I can plot it. Behold this absolute monstrosity:

f(c) = ((c(3128-c)+(c(c+1)))/2)/(((c+(3128-c))((c+(3128-c))+1))/2)

Now that's what I call a function.

Unfortunately, since this a function going from x=0 to x=3128 and y=0 to y=1, the proper viewing window would almost be a horizontal line, and since I don't really know how to scale graph plots in KAlgebra (if somebody can plot this in a tool with more flexibility let me know) I will just give you some points. When there are 782 nodes compromised (25% of total), 25% of connections are compromised, 50% of connections are compromised at 50% of the nodes, and at 90%, 90% of connections are compromised. In other words, this is linear. It may have already been clear to some of you who know how to determine the shape of a function from a formula's structure, but as long as only the minority of instances use CloudFlare, only the minority of connections use CloudFlare. All we have to do is keep it that way if we are concerned about our security. In reality, there will probably only be new CloudFlare instances which do not replace pre-existing instances alongside more non-CloudFlare instances for each one of those. Okay, I'm done nerding for now, I promise.

Closing Thoughts

So it seems like instance to instance connections, overall, are not that susceptible to CloudFlare. But before you consider your private content safe, consider that the Fediverse is not a platform where messages simply go from sender to receiver, they are usually broadcasted, as in the case of follower-private posts. With this in mind, the above math and conclusion only applies to private messages, but not much else. Follower posts and other content relying on TLS between instances, which propagate on the back of following lists, are much more likely to end up with CloudFlare (obviously including public/unlisted posts as well). If you post a follower-only post, and have a follower on a CloudFlare instance like mastodon.cloud, then CloudFlare will get your message's content. Until Fediverse software actually enforces the privacy of posts somehow, or people stop federating with instances using CloudFlare, this is a risk. I would urge developers of Fediverse software to take one of two courses for the time being (preferably both, actually):

  1. Allow instance admins to let private posts with any kind of sensitive information be configurable to go through a separate connection on large instances which unfortunately rely on CDNs rather than encouraging users to go elsewhere.
  2. Actually get around to solving the problem of actually securing activitypub posts to begin with, so we do not have to be stuck with hoping that TLS connections secure them for us.

Also, CloudFlare is not the only large CDN being used on the Fediverse, actually others have found that while the Fediverse is very logically decentralized it is not very physically decentralized. I have not looked into the security arrangement of other hosters, but my fixation with CloudFlare is primarily because I think it is largely unnecessary and from a security/privacy standpoint it is simply a proxy which decrypts HTTPS connections before they reach their proper destination without users knowing that CloudFlare has access to that data even if the server itself is self-hosted or on another VPS service. That is not all, I'd once again refer back to my original and thorough post about CloudFlare, but I think that should be enough to understand my concerns.

As long as CloudFlare has access to the unencrypted connections that our supposedly "private" messages are sent over for anyone sending from or receiving on an affected instance, we have to worry about yet another third party to trust besides the instance administrators at both ends. And unlike administrators, there is no social trust, it is just another corporation with shady governmental research origins. It has a good PR department though, but ultimately brands are just trying to optimize for maximum profit and you should not trust anything they say at their word.