scheduling of builds in get-assignment

deep42thought · 2019-07-19 08:54:44

Currently, get-assignment which is responsible for distribution of build assignments to build slaves is somewhat slow. Additionally it is written in a way that makes it not able to run concurrently for different build slaves.

I'd like to brainstorm some ideas to improve performance on that part.

Which steps it currently performs:

trivial/fast tasks: return assignments which are already scheduled for that slave or which are manually forced onto that slave (not to be confused with "prefered by that slave")
create temporary list of buildable packages (e.g. ones where all dependencies are met and which fit the architecture of the build slave)
remove all "wrong" packages from that list (packages that are currently being built (by another build slave), all not-toolchain packages iff a toolchain package is on that list)
order that list by different criteria (e.g. toolchain-build-order, priority, architecture, commit time, previous errors/build trials, etc.)
hand out the first package of that (ordered) list

proposed improvement (ideas by abaumann):

make the temporary table permanent
hand out the k-th build assignment to the k-th slave

there are some problems I see arising with this:

How do we handle rebuilds of the toolchain? All other packages should be blocked if the toolchain is being rebuilt. Also we should always only build one toolchain package per architecture simultanously. Ordering will also be interesting ...
When do we update the permanent table? on return of packages? Then we need a lock or something ...

The pressure to optimize get-assignment is not that high, it currently takes around 30 seconds to complete - which is not that high, but optimizeable.
In contrast, it is very easy to break stuff there and I invested already much time into getting the scheduling logic right.

levi · 2019-07-19 18:57:56

What programming language is get-assignment written in out of interest. What is it's architecture? Does a single long-lived server process run on the one VM/machine, and the clients talk to it over the network, or have I misunderstood the required topology? Is there a git repo I could browse or clone, or some other way of looking at the code to answer some of these questions (and others arising) myself?

deep42thought · 2019-07-19 19:05:43

it's a bash script that gets called each time a slave connects and wants some assignment.

levi · 2019-07-19 22:27:58

Okay, so it uses a mysql database and currently CREATEs a TEMPORARY TABLE for the orders and DROPs it at the end. The idea is presumably to make that a permanent table and instead shuffle things off the front and pop them on the end when their dependencies are settled. Provided (and it may be a big provide, I suppose), the toolchain should be a dependnecy of everything else. Personally, I'd suggest checking things as you pull them off the table, and if it turns out they've actually got an unfulfilled dependency, discard them and try the next one.

I'm less clear on the locking situation. There seem to be locks already implemented, but I'm not sure exactly what else you need.

deep42thought · 2019-07-22 05:26:10

the funny thing about the toolchain is not that everything depends on it, but that certain packages are built twice before being released (though, I have to admit, I'm not 100% sure, why we need that)

regarding locks: we currently lock on file level for some of the scripts - e.g. when a sanity check runs, noone should write to the package database, otherwise the sanity check will return false positives. However, some of the locks we have seem to be overcautious - but otoh we had problems with the database in the past when we relied on proper locking of stuff by the database (e.g. when we allowed simultanously returning packages).

andreas_baumann · 2019-07-22 13:02:11

Actually this toolchain locking is over-paranoid, in case that intermediary updates of the toolchain could produce funny
artifact. In theory every package in the glibc/binutils/gcc-combo should be upgradeable for itself. But only certain combinations
are known to work for all software. So it could be that you can update glibc/bintuils/gcc, but as soon as you use a mixed-version
for something else you can get spurious errors.

levi · 2019-07-22 16:10:12

deep42thought wrote:

the funny thing about the toolchain is not that everything depends on it, but that certain packages are built twice before being released (though, I have to admit, I'm not 100% sure, why we need that)

Circular dependencies, I would assume.

Yes, assuming you're using the package dependencies, most things won't depend explicitly on the toolchain, as that only really documents runtime dependencies in the main. Presumably you have to have some kind of build dependencies list or logic to ensure the relevant toolchain gets built before the package that needs it to be built?

regarding locks: we currently lock on file level for some of the scripts - e.g. when a sanity check runs, noone should write to the package database, otherwise the sanity check will return false positives. However, some of the locks we have seem to be overcautious - but otoh we had problems with the database in the past when we relied on proper locking of stuff by the database (e.g. when we allowed simultaneously returning packages).

I'd much prefer locks to be overcautious than under, so long as they don't deadlock ever. I'm having a little trouble seeing why new locks are needed here to be honest, perhaps I just need to re-read the OP, but for my sake at least if you could capture a use case where the tools should lock that they aren't doing currently that would help. Seems to me that before the required toolchain has been built, the can be built queue should be empty other than some bits of toolchain, so as long as locking currently stops queue adds and queue SELECTs and other classes of dangerous SQL then that's okay.

andreas_baumann · 2019-07-22 16:15:37

Binaries can be build with different versions of the toolchain, just the toolchain itself should basically be built in one transaction.
This currently means older binaries have been build with gcc 8 or even7, newer ones with gcc 9. This should not be a big problem
(unless things like "lack of CET-support" happen). :-)

Arch Linux

#1 2019-07-19 08:54:44

scheduling of builds in get-assignment

#2 2019-07-19 18:57:56

Re: scheduling of builds in get-assignment

#3 2019-07-19 19:05:43

Re: scheduling of builds in get-assignment

#4 2019-07-19 22:27:58

Re: scheduling of builds in get-assignment

#5 2019-07-22 05:26:10

Re: scheduling of builds in get-assignment

#6 2019-07-22 13:02:11

Re: scheduling of builds in get-assignment

#7 2019-07-22 16:10:12

Re: scheduling of builds in get-assignment

#8 2019-07-22 16:15:37

Re: scheduling of builds in get-assignment

Board footer