Status of dCache problems (2005/01/07) ====================================== Versions tested: ---------------- d-cache-client-1.0-34 d-cache-core-1.5.2-33 d-cache-opt-1.5.3-15 pnfs-3.1.10-12 | Major (*) | Normal | Total ----------------+------------+------------+------------ Fixed since | 3 | 1 | 4 2004/10/14 | | | ================+============+============+============ In progress (@) | 1 | 10 | 11 No progress | 2 | 9 | 11 New | | 2 | 2 ----------------+------------+------------+------------ Total open | 3 | 21 | 24 ----------------------------------------------------------------------------- @ 4) perror in dcap_url.c (2004/02/24) --> 2004/10/08: symbol "perror" not referenced by libdcap*.so, but when dc_opendir fails, some library routine still prints a message to stderr: Breakpoint 3, dc_opendir (path=0xbffff5aa "gsidcap://lxb0740:22128/pnfs/cern.ch/data/foo") at dcap_dirent.c:104 [...] (gdb) n 128 in dcap_dirent.c (gdb) n Command failed! Server error message for [1]: "can't get pnfsId (not a pnfsfile)" (errno 666). 130 in dcap_dirent.c (gdb) n 131 in dcap_dirent.c Furthermore, symbol "perror" is defined (!) by libgsiTunnel.so: $ nm /opt/d-cache/dcap/lib/libgsiTunnel.so | grep perror 00154c1c T globus_fatal_perror 00154ba0 T globus_perror 001624c4 T perror That does not seem right. It also defines Globus routines, which should rather come from dynamic libraries, or else this library might not be usable by a program that uses Globus already. Compare with the situation for write() and related routines: $ nm /opt/d-cache/dcap/lib/libgsiTunnel.so | grep -w write 001749d0 W write @ 5) gfalfs/fuse/dCache integration (2004/02/24) --> to be tested @ 6) O_TRUNC or overwrite of existing file (2004/02/25) --> 2004/10/12: "truncate=true" in dCacheSetup does not work (neither for SRM-based writes, nor globus-url-copy, nor dccp) 7) dcau (2004/02/25) 8) pinning to be tested (2004/02/24) @ 10) error message when missing VO directory (2004/03/10) --> see item 46 (2004/10/07) * 11) hang when writing a file and the disk is full (2004/03/12) --> 2004/10/12: this problem is still present; a hanging write of a large file is observed to fail with "Connection reset by peer" as soon as one or more small files are written successfully in parallel. We observe various inconsistent and unwanted behaviors on a system consisting of an admin node lxb0740 with a pool, and a pool node lxb1752: 1. On lxb0740 the pool became completely full: /dev/hda4 4032124 3985104 0 100% /data Since the daemons run as root, they can still write there, but this situation should be avoided; an error should already be returned, say, at 99% usage (and any partially written data should be deleted). 2. On lxb1752 we have this: /dev/hda4 4032124 3052356 774940 80% /data 3. Yet the information provider says this: $ /opt/d-cache/srm/bin/srm-storage-element-info \ -x509_user_proxy=/opt/lcg/hostproxy \ https://lxb0740.cern.ch:8443/srm/infoProvider1_0.wsdl StorageElementInfo : totalSpace =7516192768 (7340032 KB) usedSpace =7052824559 (6887523 KB) availableSpace =440035031 (429721 KB) 4. The web page http://lxb0740:2288/usageInfo has yet another idea of the situation: lxb0740.cern.ch_1 lxb0740Domain 4096 241 3730 lxb1752.cern.ch_1 lxb1752Domain 3072 126 2944 So, there should be 241 MB free on lxb0740, 126 on lxb1752. 5. Trying to store a file of 128 MB hangs, while a file of only 666 bytes is still stored successfully. 14) relocatable RPMs (2004/03/17) 15) file naming (2004/03/18) 18) srmcp and X509_USER_PROXY (currently needs complicated command line options) 20) IOTunnel library for kdcap + port number for kdcap @ 21) core dump when port not specified (2004/03/31) in some version we got an obscure error message instead: "Failed to create a control line" --> 2004/10/12: now we get a core dump again: --> 2004/12/12: still happens $ dccp /etc/group gsidcap://lxb1753:22128/pnfs/cern.ch/data/dteam/test.1 666 bytes in 0 seconds $ dccp /etc/group gsidcap://lxb1753/pnfs/cern.ch/data/dteam/test.2 Segmentation fault (core dumped) 23) manual garbage collection (2004/04/02) *@ 24) missing entry points: dc_chmod, dc_mkdir, dc_rename, dc_rmdir and dc_unlink (2004/04/02) --> dc_rename still absent (2004/10/07), the others are OK * 26) dCache SRM returns a TURL even if no space available (2004/04/02) --> 2004/10/12: see item 11: the TURL is returned and the write hangs. @ 30) getFileMetaData srm://lxshare0282.cern.ch:8443/pnfs/cern.ch/data/cms gives java exception while the directory exists --> 2004/10/08: java exception gone, but various problems remain $ gfal_teststat srm://lxb1753.cern.ch/pnfs/cern.ch/data/dteam/non-existent gfal_stat: Communication error on send --> SRM error string does not include "No such file or directory": "could not get storage info by path : CacheException(rc=666;msg=Pnfs error : can't get pnfsId (not a pnfsfile))" $ gfal_teststat gsidcap://lxb1753.cern.ch:22128/pnfs/cern.ch/data/dteam/non-existent Command failed! Server error message for [1]: "can't get pnfsId (not a pnfsfile)" (errno 666). gfal_stat: No such file or directory --> OK, but library must be silent and only set errno $ gfal_teststat srm://lxb1753.cern.ch/pnfs/cern.ch/data/dteam/myfile stat successful mode = 100644 nlink = 1 uid = 2 gid = 2 size = 666 --> GFAL uses default values for uid and gid, because none were returned (see next) $ gfal_teststat gsidcap://lxb1753.cern.ch:22128/pnfs/cern.ch/data/dteam/myfile stat successful mode = 100644 nlink = 0 uid = 18118 gid = 2688 size = 666 --> wrong nlink $ gfal_teststat srm://lxb1753.cern.ch/pnfs/cern.ch/data/dteam gfal_stat: Communication error on send --> SRM returned this obscure error string: "could not get storage info by path : CacheException(rc=35;msg=Pnfs error : OSM info not found in 000000000000000000001080(type=--I--d-----))" $ gfal_teststat gsidcap://lxb1753.cern.ch:22128/pnfs/cern.ch/data/dteam stat successful mode = 40755 nlink = 0 uid = 18118 gid = 2688 size = 512 --> wrong nlink 31) getFileMetaData does not return ownership --> 2004/10/08: uid and gid are not returned, GFAL uses defaults: $ gfal_teststat srm://lxb1753.cern.ch/pnfs/cern.ch/data/dteam/myfile stat successful mode = 100644 nlink = 1 uid = 2 gid = 2 size = 666 # ls -ln /pnfs/cern.ch/data/dteam/myfile -rw-r--r-- 1 18118 2688 666 Oct 8 23:04 /pnfs/cern.ch/data/dteam/myfile @ 32) Admin Guide + Installation Guide --> situation has significantly improved @ 33) dcap User Guide (only a few APIs are currently documented, protocols and port numbers should also be documented) --> minimally documented in dcache-user-instructions.txt @ 34) We propose that a hierarchy is implemented to set port numbers: user specified, environment variable, /etc/services, default set at compile time --> to be tested 44) logfiles must be cleaned up: time stamps and request parameters must be added, harmless errors must be removed @ 46) specifying an incorrect SRM path can make the client hang --> it hangs for /pnfs/foo, /pnfs/cern.ch, /pnfs/cern.ch/foo, /pnfs/cern.ch/data, /pnfs/cern.ch/data/foo (2004/10/07); --> 2004/12/12: it no longer hangs, but the returned errors are not quite right: $ ./test-dcache.sh /opt/lcg/bin/gfal_teststat \ srm://lxb1753.cern.ch/pnfs/cern.ch/data/dteam/foo stat successful mode = 100644 nlink = 1 uid = 2 gid = 2 size = 666 $ ./test-dcache.sh /opt/lcg/bin/gfal_teststat \ srm://lxb1753.cern.ch/pnfs/cern.ch/data/dteam/ gfal_stat: No such file or directory $ ./test-dcache.sh /opt/lcg/bin/gfal_teststat \ srm://lxb1753.cern.ch/pnfs/cern.ch/data/dteam gfal_stat: No such file or directory $ ./test-dcache.sh /opt/lcg/bin/gfal_teststat \ srm://lxb1753.cern.ch/pnfs/cern.ch/data/ gfal_stat: No such file or directory [...] $ ./test-dcache.sh /opt/lcg/bin/gfal_teststat \ srm://lxb1753.cern.ch/pnfs gfal_stat: No such file or directory $ ./test-dcache.sh /opt/lcg/bin/gfal_teststat \ srm://lxb1753.cern.ch/ gfal_stat: Communication error on send globus-url-copy returns a misleading error message instead, e.g. "553 553 /pnfs/cern.ch/data/dteam: No such file or directory. : /pnfs/cern.ch/data" 47) dc_opendir/dc_readdir robustness [remainders from bug 25] (2004/12/12) When the directory itself does not exist, dc_opendir does not return NULL. Here is an example program to demonstrate the problem: ---------------------------------------------------------------------- #define _LARGEFILE64_SOURCE #include #include "dcap.h" main(int argc, char **argv) { struct dirent64 *d; DIR *dir; if (argc != 2) { fprintf (stderr, "usage: %s filename\n", argv[0]); exit (1); } if ((dir = dc_opendir (argv[1])) == NULL) { perror ("dc_opendir"); exit (1); } while (d = dc_readdir64 (dir)) { printf ("%s\n", d->d_name); } if (dc_closedir (dir) < 0) { perror ("dc_closedir"); exit (1); } exit (0); } ---------------------------------------------------------------------- On RH7.3 the results are as follows: ---------------------------------------------------------------------- $ ./testdcapdir \ gsidcap://lxb0730:22128/pnfs/cern.ch/data/dteam/generated 2005-01-01 2005-01-03 2004-12-30 $ ./testdcapdir \ gsidcap://lxb0730:22128/pnfs/cern.ch/data/dteam/non-existent Command failed! Server error message for [1]: "can't get pnfsId (not a pnfsfile)" (errno 666). dc_closedir: Bad file descriptor ---------------------------------------------------------------------- This shows that dc_opendir returned non-NULL, upsetting the logic of the program. On SL3 the program occasionally hangs in dc_readdir64; when used by GFAL, the result is a segmentation fault. For short paths, dc_readdir always reports a single entry "data": ---------------------------------------------------------------------- $ ./test-dcache.sh /opt/lcg/bin/gfal_testdir \ gsidcap://lxb1753.cern.ch:22128/pnfs/cern.ch data $ ./test-dcache.sh /opt/lcg/bin/gfal_testdir \ gsidcap://lxb1753.cern.ch:22128/pnfs/ data $ ./test-dcache.sh /opt/lcg/bin/gfal_testdir \ gsidcap://lxb1753.cern.ch:22128/ data $ ./test-dcache.sh /opt/lcg/bin/gfal_testdir \ gsidcap://lxb1753.cern.ch:22128/pnfs/data Command failed! Server error message for [1]: "Path doesn't match" (errno 7). Segmentation fault (core dumped) (gdb) where #0 0x402c66fe in readdir64@@GLIBC_2.2 () from /lib/libc.so.6 #1 0x4069447e in system_readdir64 (dir=0x80842e8) at system_io.c:492 #2 0x4068b32c in dc_readdir64 (dir=0x80842e8) at dcap_dirent.c:177 #3 0x4068b265 in dc_readdir (dir=0x80842e8) at dcap_dirent.c:137 #4 0x400265e5 in gfal_readdir (dir=0x80842e8) at gfal.c:496 #5 0x080487ab in main (argc=2, argv=0xbfffd354) at gfal_testdir.c:31 #6 0x402361c4 in __libc_start_main () from /lib/libc.so.6 ---------------------------------------------------------------------- When the user does not have a proxy, an obscure error is printed: $ ./testdcapdir gsidcap://lxb0740:22128/pnfs/cern.ch/data/foo Error on control line [3] Failed to create a control line dc_closedir: Bad file descriptor --> 2004/12/12: now it fails slightly differently: Error ( POLLIN POLLERR POLLHUP) (with data) on control line [3] Failed to create a control line Segmentation fault (core dumped) 48) there is no easy way to add a door node (2004/12/17) There should be a script that only starts the gridftp door; /opt/d-cache/bin/dcache-opt should refer to that script. Can there be multiple (gsi)dcap doors as well? ----------------------------------------------------------------------------- Fixed: ----------------------------------------------------------------------------- *@ 25) non working dc_opendir 29) pnfs mountd incompatible with normal mountd --> will not be fixed for current pnfs version * 37) many (> ~15) parallel clients causes SRM to hang *@ 41) libgsiTunnel.so needs globus_module_activate/deactivate gssapi module to work around a Globus bug (patch available) (2004/05/18) -----------------------------------------------------------------------------