DFSc Working Group



Meeting Date

  • TEAMS: 2022-04-27

Invitees - Attendees

  • Anthony, Gouxiang, Lori, Nathan, Nick, Lawrence

Review and accept previous meeting minutes.

Notes:

Proposed Agenda Items

Old business

Action items for this meeting 2022-03-17

  • ceph-adm upgrade on the 902s (ldpaniak/nfish)
    • no action
    • Lori will look at it and plan to have Nathan rebuild
    • pilot step to the full upgrade/rebuild
    • high-fidelity test-bed for the real system
    • the 902 cluster was down, got it running - anything still connected to it??
      • plan is to re-install with the same setup of the production cluster and run the upgrade process
      • how to tell what is/has been connecting to it?
      • Lori will investigate with Nathan
      • the whole process needs to happen during the summer and it will take a while
      • seems to be that we need to break out the /uNs into separate Ceph filesystems
  • Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
    • Anthony will discuss available hardware with Fraser to determine next steps - updated ticket
    • no obvious updates
    • purchased 8 spindled drives
    • O to follow up with Fraser

New business

  • Access to DFSc performance counters, RT#1211673 dlgawley
    • Dave's not here - leave for next meeting

  • Diagnostics a2brenna - also RT#1211673
    • debug symbols
    • Anthony wants them installed
    • Lori wants to know what happens with these debug symbols?
    • Anthony: Symbolic information for debugging in stack traces, inert on the disk, takes space, but do not run in code
    • separated back in the time when disk space was very limited
    • why installed on the servers? Cannot do any meaning profiling without on the server itself
    • could maybe be used on another host if all other code lined up the same
    • but cannot do any profiling of a running system if debug symbols are not on the host itself
    • Lori - what problems are we solving?
    • Anthony - has profiled the client side and found no problems, but cannot profile the server side
    • Anthony - sampling profiler is non-invasive, unlike strace
    • Lori - current software is EOL in 6-8 weeks, is it worth doing on this version?
    • Anthony - nothing in next release notes seem to indicate major fixes, so will want to start reviewing
    • this ticket has now been resolved
  • Any plans to deal with cephfs client crashes on teaching systems? - yc2lee/gxshen - RT#1214857
    • several teaching systems have crashed in the past couple months
    • seems to be a linux kernel/ceph issue - not clear if anyone in those groups are working on it
    • eg: https://forum.proxmox.com/threads/bug-kernel-null-pointer-dereference-address-0000000000000402.106067/
    • process for now?
    • does not exist on or before 5.11 kernel
    • Anthony: concerned this is due to running the HWE kernels. Should only be running stock kernels
    • agreed at this meeting that the stock kernel should be put back onto systems as they are rebooted
    • noted in ticket RT#1217576:
      • At today's DFSC meeting, we agreed that Anthony will replace the current HWE kernels with the stock (5.4) kernel when rebooting the infrastructure systems over the next few days.
      • If you have any objections - note them here. Otherwise we assume you are all supportive of this plan.
      • If there are any noted problems, Anthony assures us that it should be able to be rolled back easily.
      • Lori's concern is about potential impacts on performance. Would not be supportive of the change if we were not experiencing crashes.
      • Anthony has diagnostic data but may require some re-working of his diagnostic tools based on the kernel change.

Upgrades

Update on status of the Ceph Dashboard RT# 973431 dlgawley

  • Lori: will follow from pending upgrades
  • Anthony: that is basically the Prometheus data

Start with 902 systems. Practice upgrades, work out bugs nfish/a2brenna/ldpaniak

  • in progress

Server side

  • Want to upgrade to Pacific by end of summer at the latest: strays, upgrade path, less OSD spill from RocksDB(sharding) currently:3, 30, 300GB..., mclock scheduler, graphana daemons. Octopus out of support 2022-06-01: Early May?
  • One MDS problem
  • Remove all snapshots?
    • in preparation for the upgrade
    • Anthony: we should ensure we have complete backups before that
    • Guoxiang: running low on disk space for index files, backups are slow, new hardware has problems, doesn't know when he can do the full backup (pretty soon?)
  • Real downtime with low/no cluster load
  • ceph deploy deprecated: Migrate to ceph-adm (docker containerization) then upgrade
    • maybe not as deprecated as originally thought
    • probably can't use ceph-adm
    • will try on the 902s
  • splitting cs-teaching into multiple real filesystems: u0-u19(?)

Client side

Scratch(ish) drives on 211 systems (fhgunn)

  • ZFS sends for sync

Upcoming maintenance

  • Ongoing failed/flakey hard drive replacement gxshen. 421 drives are 5 years old. Expect failures.
    • updates in progress

MDS instances holding strays

* Anthony: what happens to the strays if the machine holding the MDS {goes away, crashes, etc} * MDS is a data service that is triplicated, so one can go away * the strays will move * Anthony: why not just turn off the MDS? * the cluster will automatically startup a new one

Action items for next meeting

  • ldpaniak/nfish: Ceph upgrade on the 902s
  • Omar: Fraser - create a ticket for the plans for local storage option -> https://rt.uwaterloo.ca/Ticket/Display.html?id=1209206
  • Anthony: Any plans to deal with cephfs client crashes on teaching systems? - yc2lee/gxshen - RT#1214857 (revert to 5.4 kernel)
  • Clayton: NFS servers to ganesha 4.0
  • Guoxiang: need to get a full backup before the Ceph upgrade
  • Lori: disable/reduce new snapshots in anticipation of Ceph upgrade (?)
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2022-04-27 - LawrenceFolland
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback