Hard Light Productions Forums

Off-Topic Discussion => Programming => Topic started by: Goober5000 on February 28, 2010, 11:36:13 pm

Title: aaaaaaaaaaaargh (7zip rant)
Post by: Goober5000 on February 28, 2010, 11:36:13 pm
As those of you who have SCP SVN access may know, I'm currently working on a rewrite of the FSO Installer.  One of the high priority feature requests is 7zip support.

Unfortunately, the 7zip SDK is atrocious.  No in-line source comments, very little documentation, and all you get is a Decoder and an Encoder class.  That only works if you have a simple stream of bytes you want to compress using LZMA, and it's useless if you have a 7zip container with a bunch of files.

That's not the subject of the rant though, because some kind and benevolent soul figured out how to use the JNI (Java Native Interface) to wrap the actual 7zip application libraries using Java classes.  It's here at 7-Zip-JBinding (http://sevenzipjbind.sourceforge.net/), and it has bindings available for all the major platforms - Windows, OSX, and Linux.  Which is A-1 SUPAR.

The problem I have is that 7-zip puts the table of contents AT THE END OF THE FILE.  Good old vanilla Zip puts it at the beginning.  This means that you have to seek to the end of the file before you know what the file contains.

Turey was clever enough to figure out that Java allows you to unzip regular Zip files on-the-fly, which is to say that you're downloading and extracting in the same operation.  And you're only download the out-of-date files.  This is extremely handy because downloading a stream of bytes from the internet is a strictly sequential operation.

Guess what... 7zip was apparently designed on the assumption that random-access would be the norm.  So not only does 7zip require you to seek to the end of the file, it seeks back and forth a lot once it's there.  Unfortunately, in a forward-only stream, seeking backwards means that you have to start reading all over again and seek forward to that location from the beginning.

Using predictive buffering and some smart seeking strategies, I was able to smooth everything out so that only one seek, from beginning to end, is required to get all the necessary information.  Unfortunately, that one seek is necessary due to the inherent design of 7zip; it's as if you have to download the file before you can download the file.  On the 7.4 MB file I used for testing, that takes about 12 seconds.


So that's the current story.  There are several paths I can take from here, and I'd like to know what people would prefer:

1) Drop 7zip support, and enforce Zip only.  (I'm not seriously considering this option.  First, I've already done a lot of work on it; second, 7zip allows the extraction of not only .7z files but also .rar, .bzip, .gzip, and a bunch of others.)

2) Leave the behavior as it is, and tell people they'll just have to live with the delay that occurs prior to downloading.

3) Change the behavior so that the installer will always download 7zip files before checking whether they need to be extracted.  I'm not sure that this will save any time because in both cases the installer has to first move across the internet connection from start to finish, and then (optionally) extract something from it.  The only significant difference is that in the current case the extraction happens over an internet connection, whereas in this case it would occur on the local hard drive, saving bandwidth.  However, the user would have to live with a potentially large, potentially unnecessary file being downloaded to, and immediately deleted from, a temporary folder.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Bobboau on March 01, 2010, 02:13:31 am
I would say 3 or 2.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: portej05 on March 01, 2010, 02:17:32 am
This is seriously a case of design THEN code (and coding should take < 50% of the total time).
It's also a case of not understanding what the format is supposed to be doing.

I suggest 1. The other two are hacks. Just because you've spent a lot of time coding something doesn't mean it shouldn't be dropped. It just means you didn't think ahead, and worked with features rather than design.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: headdie on March 01, 2010, 02:57:41 am
does this re-reading effect how much data is being sent down the connection?
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Iss Mneur on March 01, 2010, 10:33:44 am
Personally I am not entirely sure why this is an issue.  That is, why do you need to read the TOC of the archive before you download the the rest of the archive? 

The only answer that I can see, and that Goober5000 sort of hints at is that the Installer is using the zip's TOC as a sort of manifest to see what files have changed. 

This strikes me as a clever hack that could be very problematic as it is based on the fact that the stored file modified date and/or file size will change when the file contents changes.  Unfortunately this is not always true.  I have personally run into zip archives that do not have any file times set. (though I can't say that I have noticed this with any FS related files, if only because I have not looked).

I have to agree with Zacam's suggestion on #scp when Goober5000 brought this thread up, about external manifests.  I think the external manifests would be a better solution to the various inadequacies of the different archive formats.  It would also allow for other improvements to the mod system that have been proposed over the last few months.

That being said, you could do number 1 and just not support .7z itself if you want to have other file formats, though as we talked about on #scp the tar based formats would have a similar issue because of the TOC information being interspersed with the files they would require downloading the entire file as well.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Mika on March 01, 2010, 02:20:53 pm
I think this is a case of #1. Building stuff on top of that is going to be a shaky at best and might become impossible later. After seeing this, I wonder no more why 7ZIP never became popular, as RAR and ZIP still seem to dominate.

The old software wisdom goes sometimes it is simply better to abandon the old stuff despite the work that has been done for it. And start from the scratch and rebuild it for a better standard, as in the end it usually turns out you still have saved time and lots of headaches. Of course, this being a rather small thing it might not turn out to be so, but you can also think that the work you did will help you to do accomplish something related in real life work.

I think you are lucky as this is more like a hobby project. I have managed to waste thousands of euros for trying to construct something on top of a software that ultimately turned out to be incapable of doing what we supposed it could do. And it was really at my work. Not a very happy feeling once I realized it, but we overcame it.

Good luck
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Goober5000 on March 01, 2010, 02:36:54 pm
This is seriously a case of design THEN code (and coding should take < 50% of the total time).
It's also a case of not understanding what the format is supposed to be doing.

I suggest 1. The other two are hacks. Just because you've spent a lot of time coding something doesn't mean it shouldn't be dropped. It just means you didn't think ahead, and worked with features rather than design.
You've managed to completely miss the entire problem.

This is not a case of design-then-code, this is a case of taking two incompatible designs and trying to find the best way to make them work together.  The Java stream API is well documented, extensively tested, and heavily used in all manner of applications.  The canonical FSO Installer code accesses and downloads the contents of Zip files using a standard algorithm, one which is also heavily used, and one for which the Zip file format is well suited.  It's so well suited, in fact, that Zip support is included in the core JDK.  The problem at hand is how to take the idiosyncratic 7Zip API and make it work with the established standard.

Now I incorrectly stated that the Zip format puts the table of contents at the beginning; in fact, it puts it at the end, just like 7Zip.  So if it's possible to access the contents of a Zip file sequentially, using the local file headers instead of the TOC, then the same should be true of 7Zip.  Unfortunately, the 7Zip API doesn't provide a method for doing this.  If the API could be enhanced or modified, then this would provide a fourth solution to the problem.

And you ought to know that I've happily dropped projects that I've worked on for much longer than this.  It should be clear from my post that the reason I want to keep 7Zip is because of the support it offers for different file formats, not because of the time I spent working on it.


does this re-reading effect how much data is being sent down the connection?
It depends on the server.  But in general, seeking to the end of the file counts as one read, and extracting the file counts as another read.  So you're traversing the file twice.


Personally I am not entirely sure why this is an issue.  That is, why do you need to read the TOC of the archive before you download the the rest of the archive?
You don't, as I learned today.  But 7Zip reads it anyway, as soon as you open the archive, and doesn't provide a method to skip this step.

Quote
The only answer that I can see, and that Goober5000 sort of hints at is that the Installer is using the zip's TOC as a sort of manifest to see what files have changed.

This strikes me as a clever hack that could be very problematic as it is based on the fact that the stored file modified date and/or file size will change when the file contents changes.  Unfortunately this is not always true.  I have personally run into zip archives that do not have any file times set. (though I can't say that I have noticed this with any FS related files, if only because I have not looked).
Actually, the installer checks the file size, not the modification date.  I was skeptical about this strategy too, but it has worked for several years without problems.


Quote
I have to agree with Zacam's suggestion on #scp when Goober5000 brought this thread up, about external manifests.  I think the external manifests would be a better solution to the various inadequacies of the different archive formats.  It would also allow for other improvements to the mod system that have been proposed over the last few months.

That being said, you could do number 1 and just not support .7z itself if you want to have other file formats, though as we talked about on #scp the tar based formats would have a similar issue because of the TOC information being interspersed with the files they would require downloading the entire file as well.
External manifests is a good idea, but it causes problems with maintenance and reverse compatibility.  It also might be redundant, especially if this API problem can be solved.

And judging by the sense of the forum, 7Zip support is pretty much a requirement in any Installer upgrade.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: portej05 on March 02, 2010, 01:10:10 am
I still don't get why 7Zip is a requirement - just because popular consensus says 'we want it' doesn't mean they can have it.
It's an inappropriate technology for the task. That is why #1 would be the only acceptable solution for high-quality software.

Additionally, you know as well as I do that file sizes are absolutely inappropriate for this use.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Goober5000 on March 02, 2010, 02:00:13 am
I still don't get why 7Zip is a requirement - just because popular consensus says 'we want it' doesn't mean they can have it.
While this is true in principle (take GeoMod for example), this is a far cry from such an extreme case.  First of all, 7Zip support is, if not the very top, one of the top three requested features for the installer upgrade.  Second of all, 7Zip use is widespread on the forum and regularly used for thread downloads; it would be highly inconvenient for a project to maintain both a preferred 7Zip download and a Zip download for the installer.  Third of all, 7Zip support doesn't merely enable extraction of .7z files, it also enables extraction of .rar and a ton of additional formats, all in one single library.  (A couple of years ago, RAR was as popular on the forum as 7Zip is now.)  Fourth of all, and most importantly, 7zip extraction actually works now; it's just somewhat inconvenient.

Quote
It's an inappropriate technology for the task. That is why #1 would be the only acceptable solution for high-quality software.
This is, I think, another case where you're focusing entirely on the academic and theoretical arguments, and ignoring or dismissing the practical requirements.  It's basically a recipe for shooting yourself in the foot.  Don't worry, experience will help with that. :)

Quote
Additionally, you know as well as I do that file sizes are absolutely inappropriate for this use.
As I said above, it has worked for many years.  There have been absolutely no complaints about the installer spuriously redownloading files based on an incorrect file size comparison.  So while it's something to keep in mind, it's an extremely low priority.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Fury on March 02, 2010, 02:04:35 am
7Zip is far superior in compression than zip or other older algorithms. Where zip compressed BP would be 451 MB, 7z is only 357 MB. I for one, will never make zip-downloads for Blue Planet simply because the large difference in download sizes. 7z is more convenient for both uploader and downloader.

I really don't care at all about 7z's inability to extract on the fly. Time saved in downloading significantly smaller files is bigger than downloading it all first and only then extracting. HDD space is much less of an issue than internet speed.

Zip sucks as compression algorithm for binary files.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Aardwolf on March 02, 2010, 02:34:49 am
I still don't get why 7Zip is a requirement - just because popular consensus says 'we want it' doesn't mean they can have it.
While this is true in principle (take GeoMod for example), this is a far cry from such an extreme case.  First of all, 7Zip support is, if not the very top, one of the top three requested features for the installer upgrade.  Second of all, 7Zip use is widespread on the forum and regularly used for thread downloads; it would be highly inconvenient for a project to maintain both a preferred 7Zip download and a Zip download for the installer.  Third of all, 7Zip support doesn't merely enable extraction of .7z files, it also enables extraction of .rar and a ton of additional formats, all in one single library.  (A couple of years ago, RAR was as popular on the forum as 7Zip is now.)  Fourth of all, and most importantly, 7zip extraction actually works now; it's just somewhat inconvenient.

I agree... the launcher should feature GeoMod.  :drevil:
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: chief1983 on March 02, 2010, 04:00:56 am
Since when did this SDK support other compression formats beside LZMA?

And I would much rather see the launcher not rely on file sizes.  What about checking the files for corruption?  A flipped bit won't change the file size but it'll corrupt the entire thing.  Hashing is a requested change right up there with getting rid of file size only verification.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Goober5000 on March 02, 2010, 12:39:39 pm
This new feature doesn't use the SDK (which is next to useless); it uses a Java binding of the entire 7Zip extraction mechanism.  So it can extract anything 7Zip can.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: chief1983 on March 02, 2010, 01:04:01 pm
So you're including a chunk of 7-zip itself in the installer then?  Ok.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: portej05 on March 03, 2010, 05:15:52 am
Don't forget that 7zip is LGPL/unRAR, so not only will you need to make sure you use the dynamic library, you will also have to distribute it.

You're not thinking about the whole ecosystem architecture. I'd also take this opportunity to point out that if you're going the direction I think you're going with this, you're introducing a rather large security vulnerability into FSO (and we _KNOW_ that code execution vulnerabilities have been found in FSO).
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Goober5000 on March 03, 2010, 01:38:58 pm
You continue to make unsupported assumptions.  I'm well aware of the licensing issues; the FSO Installer will be (and is already, actually) licensed under the GPL, and its source code will be (and is already) freely downloadable.

The FSO Installer is not the same application as FSO itself.  It's not even written in the same language as FSO.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: chief1983 on March 03, 2010, 02:27:15 pm
Instead of using the 7-zip header why not store the info we need about the files elsewhere?  Using some hashing and a table or something.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: jr2 on March 03, 2010, 04:22:33 pm
Instead of using the 7-zip header why not store the info we need about the files elsewhere?  Using some hashing and a table or something.

And; is it possible to download just the last part of a file first and then read it?  I know download managers can resume files, so why not get the file size, download the last xx bytes (I don't really know if this is feasible, so please forgive me if it's out of the question), and read the TOC?
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: chief1983 on March 03, 2010, 05:15:38 pm
It's probably as feasible as how it was being done.  But the actual file structure of the rest of the archive might still be a problem.  You can't just get one file out of a 7-zip.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Galemp on March 03, 2010, 07:12:06 pm
I feel your pain. Just slap a "Verifying contents of package..." progress indicator on it and let it take as long as it likes.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Mongoose on March 03, 2010, 10:02:48 pm
Yeah, from the standpoint of a normal end-user, #2 seems like the best option, all things considered.  #1 isn't really practical, since there are a number of mods that take advantage of 7Zip's superior compression abilities, and #3 potentially raises bandwidth/HD space/fragmentation issues.  I doubt anyone who's relying on the installer is going to mind, or even notice, a bit of extra delay.

(One has to wonder why the 7Zip format is so ass-backwards in the first place, though...)
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Goober5000 on March 03, 2010, 11:53:05 pm
And; is it possible to download just the last part of a file first and then read it?  I know download managers can resume files, so why not get the file size, download the last xx bytes (I don't really know if this is feasible, so please forgive me if it's out of the question), and read the TOC?
This is actually what the 7Zip guy and I are discussing now.  I'm going to try some things this weekend.


I feel your pain. Just slap a "Verifying contents of package..." progress indicator on it and let it take as long as it likes.
:yes: Win.
Title: Re: aaaaaaaaaaaargh (7zip rant)
Post by: Tomo on March 06, 2010, 09:16:45 am
"Verifying..." blah is how Windows Installer (MSI etc) starts anyway.

It shows the endless progress bar, for that, so if you've got some indication from the API of how much work there is to be done, and can show a real progress bar then you are doing it so much better than Microsoft.