AWS S3 for storage image data

Reading here and there I get the impression PanoraMax does not use AWS S3 ?

The cost “per Gigabyte” for AWS-S3 is a fraction ( I expect about 1/10) of the price of HDD space (which on its turn is cheaper (about half?) of SSD space)

I get the impression (most?) of the code concerning is written in Python. I (mostly) work in PHP, so contributing parts of my code won’t be helpfull I’m afraid…

The ''downside" of AWS-S3 atorage is that you can only store data, no executable scripts and such. But we’re talking here about huge amounts of images… so I think it would be a huge advantage for Panoramax if it could work with AWS-S3 storage?

AD: For “pro data management” I have “hacked” into Nextcloud (https://nextcloud.com/) I use their platform for the data-management and built ‘hooks’ to access the data. This gives me the advantage of their platform (like a cloud-network-disk and such) with which I expand upon for GeoArchive…

AWS ? No
S3 ? Yes, partly, IGN instance stores pictures on OVH S3 service.

Panoramax backend code supports local storage or S3.

I’m absolutely not sure that AWS-S3 is 1/10th of HDD space…

Last HDD we installed in the OSM-FR Panoramax server cost us less than 8€/TB. If you add redundancy (ZFS) and backup (ZFS too), we are around 20€/TB… and this is for the disks lifetime (several years). Just add some monthly electricity to that.

AWS S3 calculator tells me that hosting 1TB costs 23$/month… without additional trafic and not very clear about backups.

About OSM-FR instance, I like to say “There is no cloud, only our own computers:wink:

S3? Yes: great!

“AWS” are the developers of the standard, I use OVH also :slight_smile: (I want everything hosted within the EU, so no Amazon :wink: )

Ah, That is an approach I hadn’t thought of, in my experience virtualization has more or less become somewhat of a standard way of doing things… But in this case I do understand the benefits of your approach…

I have a 22TB backup NAS locally… I have thought about using that, but the quality of my office internet connection is not as good as where I host my servers…

If I may ask, how much TB of data does Panoramax/OSM-FR use?

34TB for the pictures
11TB for the derivates (thumbs, tiled, etc)
On SSD we also have 200GB for the database.

This is for a total of 15.5M pictures, so roughtly 3TB per million of pictures (most of them not 360).

I check OVH S3 pricing, 1TB is at 7 or 18€/month (+ VAT) depending on the I/O speed level you select.
For IGN instance, we had to switch to high speed level, the basic one was not great at all.

Thank you for this info.

“Originals”
Reading this 3/4 of the data is needed for the base data. That information is not mandatory for ‘online usage’.

Could it be possible to link that data to a ‘local NAS structure’? My local NAS solution (Synology) works with 4 drives of 7.3Tb each. That gives me 21 Tb with 25% for one drive to die of. I think that cost me something between € 25 en 30 per Tb. The advantage of having this locally is that that data does not need to be “on site”. It doesn’t even need to be available 24/7. This could potentially reduce running cost immensely? For in case the data is needed it could simply be mapped as a cloud drive to the bare metal server. Speeds wouldn’t be ‘top of the bill’, but is that necessary?

I don’t have a complete view yet of the running costs, and I don’t know how high the need is for data certainty (for in case my house goes up in flames and the NAS doesn’t survive… what if disaster strikes the hosting location? Do you have a “disaster backup”?)
Synology also has the ability to backup on an other NAS. That could give the possibility to use the 100% capacity of the NAS and also backup the data on an other NAS at an other location?

How far do we want/need to go?

I think we’re not in sync somewhere…

To provide online access to the picture, your need:

  • access to the pictures themselves, and their derivates
  • acess to the API which depends on the postgresql database

For OSM-FR instance, this means access to 45TB of HDD storage (pictures+derivates) and the running API with the DB behind it (0.2TB / 200GB on SSD).

Yes, in my basement, pictures + API container (including the PG database), all done using ZFS sync.

I think we’re not in sync somewhere…

Oh dear…
Why? For the data stored in EXIF / IPTC? I expected that info be available in SQL? Or does the zooming go to the original at some point?

When the thumb, sd, and sections of the three zoom levels have been generated (if I’ve got that correctly) we don’t need the original anymore, do we?

On that other part, ‘derivatives’:
Derivates
If I have seen it correctly it uses JPEG. Why not WEBP? I have done extensive tests and in the most pessimistic calculation WEBP at comparable compression works with at least 20% smaller files!

I have been part of a similar discussion at NextCloud. they also use JPEG as the “preview format” which in my opinion is outdated… all browsers that are not archaic can handle webp. And to my knowledge so does PSV. That could shave some 20 to 30% from that 11Tb?

We keep original (blurred) pictures not to loose details that may be usefull outside of simply “viewing” pictures. We keep a copy of EXIF/XMP metadata in the PG database.

Event if Panoramax is a “street view” like project, it is not “street” limited, nor “view” limited (in a browser) :wink:

So for each picture we have:

  • the original (blurred) one
  • a 2048px wide version
  • a thumbnail
  • and a tiled version for 360 ones

The last 3 are what we call “derivates” and they are all using JPEG.

WebP has been considered, it may be a good option for the derivates, to reduce storage needs and bandwidth, and we could also provide JPEG compatibility with on-the-fly recompression.

Another storage optimization is possible on 360 pictures where we could generate the tiles by cutting JPEG blocks (no recompression), and rebuild a full size “original” from them with no loss in the process (I did some tests for that).
This could save around 50% of the storage need for 360 pictures.

We must certainly keep the originals, no discussion there, but when all the derivatives have been generated, Panoramax only keeps the original (with blurring) images as a sort of backup does it not? Or are there reasons within the project where one would need the original image by design?

I recently visited an archival institution in my region and they use Jpeg 2000 with the capability of multiple embedded resolutions. I was surprised they adopted this standard so enthusiastically. And as for file size… it “simply” combined multiple images to one container… so personally I not extremely impressed with Jpeg 2000. The only interesting part in my opinion is the new compression technique. But in my experience WEBP has the best results.

“Cutting up” the original for the tiles and being able to put them together without loss, would be a very original way to ‘multi use’ data without any loss of functionality!

If we don’t keep the original, on non 360 pictures, the best resolution we would have would be 2048 pixels wide.

The goal is to keep the best quality/resolution, that also why we avoid recompression even during the blurring process.

This may be needed for some reuse that are not just humans looking at pictures on a screen.

People who contributed to Mapillary were not very happy not to be able to retrieve their original quality and only recompressed usually lower resolution version of their uploads.

JPEG2000 has its fans in the archiving departments. The standard is very complex with lots of options that makes it easy to generate but quite hard to consume. Most opensource JPEG2000 implementations are not efficient, need a lot of CPU for a minimal gain in storage in many cases.
Browsers generally do not support JPEG2000.

We stay with JPEG because this is what most camera and smartphones are using.