The comments in fd.c have long claimed that all file allocations should go through that module, but in reality that's not always practical. fd.c doesn't supply APIs for invoking some FD-producing syscalls like pipe() or epoll_create(); and the APIs it does supply for non-virtual FDs are mostly insistent on releasing those FDs at transaction end; and in some cases the actual open() call is in code that can't be made to use fd.c, such as libpq.
This has led to a situation where, in a modern server, there are likely to be seven or so long-lived FDs per backend process that are not known to fd.c. Since NUM_RESERVED_FDS is only 10, that meant we had *very* few spare FDs if max_files_per_process is >= the system ulimit and fd.c had opened all the files it thought it safely could. The contrib/postgres_fdw regression test, in particular, could easily be made to fall over by running it under a restrictive ulimit.
To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD that allow outside callers to tell fd.c that they have or want to allocate a FD that's not directly managed by fd.c. Add calls to track all the fixed FDs in a standard backend session, so that we are honestly guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE limit in a backend's idle state. The coding rules for these functions say that there's no need to call them in code that just allocates one FD over a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases. That means that there aren't all that many places where we need to worry. But postgres_fdw and dblink must use this facility to account for long-lived FDs consumed by libpq connections. There may be other places where it's worth doing such accounting, too, but this seems like enough to solve the immediate problem.
Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs. (Callers can choose to ignore this limit, but of course it's unwise to do so except for fixed file allocations.) I also reduced the limit on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2). Conceivably a smarter rule could be used here --- but in practice, on reasonable systems, max_safe_fds should be large enough that this isn't much of an issue, so KISS for now. To avoid possible regression in the number of external or allocated files that can be opened, increase FD_MINFREE and the lower limit on max_files_per_process a little bit; we now insist that the effective "ulimit -n" be at least 64.
This seems like pretty clearly a bug fix, but in view of the lack of field complaints, I'll refrain from risking a back-patch.
3d475515a1 Account explicitly for long-lived FDs that are allocated outside fd.c.
contrib/dblink/dblink.c | 53 ++++++++++++
contrib/postgres_fdw/connection.c | 30 ++++++-
src/backend/access/transam/xlog.c | 13 +++
src/backend/postmaster/pgstat.c | 3 +
src/backend/postmaster/postmaster.c | 33 +++++++-
src/backend/postmaster/syslogger.c | 5 ++
src/backend/storage/file/fd.c | 116 +++++++++++++++++++++++---
src/backend/storage/ipc/dsm_impl.c | 15 +++-
src/backend/storage/ipc/latch.c | 44 ++++++++++
src/backend/utils/misc/guc.c | 2 +-
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/include/storage/fd.h | 11 ++-
12 files changed, 304 insertions(+), 23 deletions(-)