Introducing: Panoramax NODE

Introduction:
Me and a few people at OSM-NL are creating awareness and are researching the possibilities of setting up a Dutch instance of Panoramax.

Panoramax natively supports instances, so technically it should not prove a problem to set up the Dutch instance.

The goal is to create a durable and scalable bases for panormax.nl. In creating the road-map we are hitting a few bumps in the road:

The biggest challenge (at least in The Netherlands) of setting up a good and durable instance is the space needed for the (potential) huge amount of images! Due to the fact that here “physical servers” are rare now and relatively expensive and most hosting providers offer “virtual servers”. With these virtual servers it’s impossible to simply extend the disk space with a bunch of hard disks (HDD).

There are quite a few (semi) government and commercial organisations that are interested and/or are thinking about setting something else up where panoramax.nl would be the answer for. Some already have huge amounts of imagery that would (over)load panormamax.nl from day one in ways of bandwidth, CPU power and disk space.

But, as my father said, we don’t have problems, we have challenges! So I’ve been thinking…
And I think the solution is:

Panormax HUB !
A (big) participant will donate a serious amount of disk space that is connected to the internet and is freely accessible by anyone.
The participant installs the Panormax-HUB software which does the following:

  • downloads all of the imagery of the instance (or a selection, with a minimum % of imagery managed by the instance)
  • keeps the data in sync
  • via the sync process registers & tracks the performance of the HUB, resulting in an uptime score and a performance score: [bandwidth] / ([response time] x 3) (x3 => making bandwidth more important…)

The HUB software at the side of the instance does the following:

  • tracks which HUB has which image, with the goal of having every image available on at least 3 HUBs
  • based upon the uptime and performance score balances all image requests to the various HUBs

Because the instance controls the database it is needed to have a HUB account. With that account the participant can freely query the database. The benefit is that all the image results can be obtained from their own HUB, greatly improving their response times and greatly reducing the load on the instance. With the added benefit for the instance that the images on the HUB are not only a backup, but also a way to reduce the bandwidth load on the main system.

Furthermore, the backup location can be “HUB one”, providing the basis for image access and distribution. In this approach the instance only mainly needs CPU & memory power.

And so providing an answer for:

  • the “Dutch disk challenge”
  • (multiple) backups
  • bandwidth distribution
  • broad access for large participants
  • basic donation module for large participants (by donating space and bandwidth)

Inspiration & Perspiration
As a wise man once said: genius is 1% inspiration and 99% perspiration
What you just have read is the 1% part… who’s up for the other 99%?

[UPDATE] please read ‘node’ where ‘HUB’ was writen, it’s a better word for it

4 Likes

When we started thinking about decentralization/federation, we discussed around distributing storage and adding redundancy in some way like having instances acting also as backups for each other.

Limited ressources are (by currently decreasing priority from my point of view):

  • storage
  • bandwidth
  • CPU

Storage increases with the quantity of pictures shared which is constantly increasing even at start.
Bandwidth increases with the quantity of pictures viewed or used.

One possible first step would be to have panoramax instance backend able to manage an array of S3 storage nodes (instead of only one local or S3 storage) and distribute storage on them with or without redundancy.

Distributing bandwidth seems more complex because you need to measure it and move your content depending on which pictures are accessed and maybe how frequently.

In all cases, I think some orchestration is needed to balance storage and/or bandwidth, deal with nodes being down, etc…

Where you write ‘node’ I wrote ‘HUB’, I am assuming we both mean the same thing.

With what I mean with a HUB is much different then an instance! Distributing data between instances would only make the “data challenge” bigger for each instance! A HUB is not much more then a data storage location for an instance.

S3
That was my initial thought also, but if I remember correctly you pointed out to me that “HDD” was much cheaper. So for an initial start S3 could be a bases as it is the best runner up solution.
But as far as I know you (FR) do not use it because it costs quite a bit when it adds up… and the road-map I’m working on needs to be “future proof” if we’re going to get those participants I wrote about on board.

Balancing:
The instance-HUB-controler will need to regularly update imagery and check the availability of the participating HUB. In doing that a score can be kept in how fast those updates go. When the uptime score becomes to low the HUB will be cut of from the “access list” (and syncing will slow down and lose the benefits until the uptime score is good enough again).

Images/tracks that are accessed relatively often will be part of the % of images that the instance may put on a HUB even when it is outside the configured selection.

I own an ICT company and maintain several servers.
On those servers some have a few dozen Gb of space available just asking to be used as an Panoramax-HUB.

When Panoramax-HUB software could be relatively easy installed on such a server it would become much easier to get an panoramax instance up and running.

And naturally it is quite necessary that the participating hubs are relatively reliable…

AD:I think “node” is indeed a better word then HUB.
In my case the primary instance would be accessible via https://panoramax.nl
and the primary / first node via https://node1.panoramax.nl
en the second via https://node2.panoramax.nl etc.

For me a hub is something where you plug several other things… (eg: USB hub)

Maybe I did not understand the articulation between an instance, its storage nodes, and instances together as we have 3 concepts here:

  • decentralization (instances),
  • federation (thru the metacatalog)
  • and the new one: distribution (storage/bandwidth) which I link at each instance level in my understanding.

Currently each instance takes care of its own storage (local HDD or S3) and backups.
The first step I see is to make a change only at this level, to allow one instance to distribute its storage on several nodes. Distribution logic is controled by the instance, pictures are stored on one or several storage nodes and when someone needs to access a picture it is accessed directly from the storage node using HTTP/S3 exactly as we do it on IGN instance, but in that case with a single S3 storage service.

The metacatalog harvests this instance catalog as any other. When someone finds a picture using the meta-catalog, the URL to access it points to the instance which sends back a temporary redirect to the final location where the picture can be actually read. Bandwidth balancing could be orchestrated at that level.

Regarding S3, I see it as a protocol to use not as a service you rent (because of the recurring cost). It is only a very common protocol to remotely use storage (mostly HDD when you’re in the TB scale).

Setting up an S3 service on your own hardware is quite simple and standard, like deploying MinIO or for larger multi-servers setup using Ceph.

1 Like

Exactly!

And when an “node participant” can log in, so that the redirect will go to their own node the bandwidth usage and response times will ideal for all parties involved!

PS: a 302 redirect is indeed a nice way to manage BW balancing on that level!

It introduces some latency, which is marginal compare to the picture download time for the full res ones given their size.

Maybe lowres and thumbnails could remain on the instance storage, and only fullres and tiled version stored on nodes…

Or simply available on a default node (set in a JS variable, or Array when multiple are available and pick one node at random)

Related : panoramax being featured on the home page of hacker news could bring other international instance ideas :slight_smile:

3 Likes

This kind of balancing and data distribution is best done at the storage level.

As Christian said, you can host a S3 server yourself. The nice thing is that it decouples the application server and the storage server(s): you can easily add more servers, more disks, move data around, without impacting the application. And the other way around, you can upgrade/move your application server without migrating the data.

Take a look at garage for instance.

AFAIK the people at https://www.fediversity.eu/ are planning to setup a very large Garage S3 cluster for federated community project, maybe they could host your panoramax data for free.

1 Like

Garage is really interesting and could be the way to distribute storage.

Garage has a gateway mecanism which can hide node availability in a cluster.
It is not perfect as it looks like in such a case all trafic have to go thru a single point, so while the storage is ditributed, the bandwidth is not.

The architecture could look like:

  • Panoramax instance:
    • local PG database
    • local temp storage
    • S3 garage based storage for fullres pictures and derivates (they can already be split)
      • cluster of garage nodes

I can setup a couple of garage nodes in different locations to do some tests (ping @Eesger).

Interesting ! I read that NLnet is part of Fediversity. I have sent them a mail and am now anxiously awaiting response