Log Manager and the Archive Array

Let me give a warm up before jumping into how this was understood.

A very generic introduction to the environment, Long story short -

  1. 2 databases in Environment
  2. 1 for live application, 1 for reporting
  3. live application database gets restored to the reporting database everyday before the business starts every day.
  4. Soon after the restore happened we started seeing log archival failures all of a sudden from no where
    2016-02-11-06.22.26.070851-360 I100204A452        LEVEL: Warning  
    PID     : 11469034             TID  : 3678        PROC : db2sysc 0
    INSTANCE: db2             NODE : 000                         
    EDUID   : 3678                 EDUNAME: db2logmgr (SAMPLE) 0    
    FUNCTION: DB2 UDB, data protection services,                      
    sqlpgRetryFailedArchive, probe:4760                                    
    MESSAGE :
    Unable to find log file 9543 for LOGARCHMETH1 using     
    method 4 and<                                                          
              target .  Assume log was archived and continue.

Few things/pattern we noticed with this issue are

  1. These errors started soon when the Rollforward phase of the restore started- quite obvious as the logging activity starts right from the Rollforward.
  2. Every time the archival failure is happening, db2logmgr is failing trying to find different log files to get them archived like below.

FUNCTION: DB2 UDB, data protection services,                      
sqlpgRetryFailedArchive, probe:4760                                    
MESSAGE : Unable to find log file 9543 for LOGARCHMETH1 using     
method 4 and<                                                          
          target .  Assume log was archived and continue.
   
FUNCTION: DB2 UDB, data protection services,                      
sqlpgRetryFailedArchive, probe:4760                                    
MESSAGE : Unable to find log file 9544 for LOGARCHMETH1 using     
method 4 and<                                                          
          target .  Assume log was archived and continue.
FUNCTION: DB2 UDB, data protection services,                      
sqlpgRetryFailedArchive, probe:4760                                    
MESSAGE : Unable to find log file 9545 for LOGARCHMETH1 using     
method 4 and<                                                          
          target .  Assume log was archived and continue.
FUNCTION: DB2 UDB, data protection services,                      
sqlpgRetryFailedArchive, probe:4760                                    
MESSAGE : Unable to find log file 9546 for LOGARCHMETH1 using     
method 4 and<                                                          
          target .  Assume log was archived and continue.
 

We did a force archive to see if this will any way sort the issue


db2 archive log for db SAMPLE

And it did resolved the issue and log archivals are getting completed successfully thereafter, but how long? till the next day restore.

Yes, this issue came up again in the next day daily restore, although the issue is being solved by doing a force archive we are more interested in RCA of this and raised an SR with IBM to find the dark secrets and internals of log manager and how and why is db2logmgr is assuming some log file which does not even exist to archive?

Initially IBM said that’s a warning message and be completely ignored, but we do have alerting system quite well integrated in the environment and serious though in place we decided to further work with IBM on why is this popping up rather than suppressing the alert.

So when further worked with IBM on this, there came some interesting conceptual facts of the db2logmgr.

Below are the set of Questions from us and answers from IBM specialist.

Question - How long will the below warning messages appear


Log Manger first calls sqlpgArchivePendingLogs to archive any log file 
that is pending to be archived. If it fails to archive any of the log  
files, the log file will be put into the archive array.                
                                                                       
2015-12-10-06.19.59.222222-360 E86409A379         LEVEL: Info     
PID     : 11469034             TID  : 3678        PROC : db2sysc 0
INSTANCE: db2             NODE : 000                         
EDUID   : 3678                 EDUNAME: db2logmgr (SAMPLE) 0    
FUNCTION: DB2 UDB, data protection services, sqlpgArchiveLogFile, 
probe:3108                                                             
1830 DATA #1 : <preformatted>                                          
Started archive for log file S0009543.LOG.                        
                                                                  
2015-12-10-06.19.59.222549-360 E86789A391         LEVEL: Warning  
PID     : 11469034             TID  : 3678        PROC : db2sysc 0
INSTANCE: db2             NODE : 000                         
EDUID   : 3678                 EDUNAME: db2logmgr (SAMPLE) 0    
FUNCTION: DB2 UDB, data protection services, sqlpgArchiveLogFile, 
probe:3150                                                             
MESSAGE : ADM1848W  Failed to archive log file "S0009543.LOG" to ""
from "".                                                               
                                                                   
2015-12-10-06.19.59.222873-360 E87181A447         LEVEL: Warning  
PID     : 11469034             TID  : 3678        PROC : db2sysc 0
INSTANCE: db2             NODE : 000                         
EDUID   : 3678                 EDUNAME: db2logmgr (SAMPLE) 0    
FUNCTION: DB2 UDB, data protection services, sqlpgArchiveLogFile, 
probe:3155                                                             
DATA #1 : <preformatted>                                          
Failed to archive log file S0009543.LOG to  from                  
/var/udb/logs1/db2/SAMPLE/ return code -2045837302.             
                                                                   
2015-12-10-06.19.59.223014-360 I87629A385         LEVEL: Warning  
PID     : 11469034             TID  : 3678        PROC : db2sysc 0
INSTANCE: db2             NODE : 000                         
EDUID   : 3678                 EDUNAME: db2logmgr (SAMPLE) 0    
FUNCTION: DB2 UDB, data protection services,                      
sqlpgArchivePendingLogs, probe:1500                                    
MESSAGE : Log archive failed with rc -2045837302 for LOGARCHMETH1.

                                                                       
Afterwards, Log Manager will scan the archive array through function   
sqlpgScanArchArray and retry archiving any failed attempts. The time   
between the first failed attempt and retry is determined by the config 
parameter archretrydelay.                                              
At this time, archiving is handled by function sqlpgRetryFailedArchive.
SQLO_FNEX will be returned if Log Manager could not find the file and  
will assume the log file has already been archived. Therefore no further
archive attempts will be made.                                         
                                                                       
2015-12-10-06.20.26.070851-360 I100204A452        LEVEL: Warning  
PID     : 11469034             TID  : 3678        PROC : db2sysc 0
INSTANCE: db2             NODE : 000                         
EDUID   : 3678                 EDUNAME: db2logmgr (SAMPLE) 0    
FUNCTION: DB2 UDB, data protection services,                      
sqlpgRetryFailedArchive, probe:4760                                    
MESSAGE : Unable to find log file 9543 for LOGARCHMETH1 using     
method 4 and<                                                          
          target .  Assume log was archived and continue.


Question - We noticed that once we issue force archive we see no warning popping up again – what I'm wondering here is how DB2 logmgr suddenly came to know that the LOG file is already existing.


Once Log Manager could not find the log files while it is in           
sqlpgRetryFailedArchive, it will update firstFailedArch1 or            
firstFailedArch2 (depends on which archive method you are using) to a  
higher log extent number.                                              
firstFailedArch is used for determining the first log file Log Manger  
failed to archive. Later on when you forces the log files to be        
archived, Log Manager will not check for the missing logs since they are
not pending to be archived and                                         
firstFailedArch1 or firstFailedArch2 already moved up. 


Thanks to that IBM Specialist who just revealed some interesting notes on how db2 internally handles the log array.

No comments:

Post a Comment