MA5800 OLT Board H903GPHF reset issue
Dec 26,2024
Moka
Issue Description
Customer observed that H903GPHF Board reset automatically installed in one of our MA5800 OLT.
As per alarms before the communication Failure alarm between service board and control board there was memory abnormal alarm on service board.
customer needs to assistance as this event resulted in service outage and to avoid such outage again resolution is required.
Alarm Information
Following alarms were received:
2541607 01/06/2023 03:17:15+05:00 The board hardware abnormity
recovers. FrameID: 0, SlotID: 14,
Parameter1: 0, Parameter2: 10,
Hardware location: Board, Fault
name: The memory of the service
path is abnormal, Fault effect:
Memory data is incorrect
2541582 01/06/2023 03:16:15+05:00 The communication of the board
with the control board recovers.
FrameID: 0, SlotID: 14, Board
Name: H903GPHF
2541580 01/06/2023 03:14:34+05:00 The communication between the
board and the control board fails.
FrameID: 0, SlotID: 14, Board
Name: H903GPHF
2541574 01/06/2023 03:09:56+05:00 The board hardware is abnormal.
FrameID: 0, SlotID: 14,
Parameter1: 0, Parameter2: 10,
Hardware location: Board, Fault
name: The memory of the service
path is abnormal, Fault effect:
Memory data is incorrect
Handling Process
ask the customer for more information:
- IS this a new board or it was working fine before then issue happened suddenly ?
if happened suddenly, please mention the date and time. - Was there any recent action done to the NE ? software upgrade / patch update / hardware replacement ?
- Have you tried to connect the board in different slots ? what was the results.
- Have you tried to reset the board ? what was the results .
- Have you tried to replace the board ? what was the results.
- Please help to provide the basics log for the following
(config)#scroll
(config)#display time
(config)#display sysuptime
(config)#display patch all
(config)#display io-packetfile information
(config)#display board 0
(config)#display board 0/x (main control-board)
(config)#display version
(config)#display current-configuration
(config)#display alarm history all
(config)#display event history all
(config)#display interface
(config)#display log all
[Reset/Reboot History Information]
(config)#diagnose
(diagnose)%%display reboot-record active
(diagnose)%%display reboot-record standby
(diagnose)%%display reset-record
(diagnose)%%display elabel 0 // this command need 3 minutes to finish so please wait till the output appear
Then we collect it:
1- IS this a new board or it was working fine before then issue happened suddenly ?
if happened suddenly, please mention the date and time.
This isn’t a new board, Issue happened suddenly. Issue happened last night. Complete alarms with date and time already shared in first email.
2- Was there any recent action done to the NE ? software upgrade / patch update / hardware replacement ?
No such new action performed on the NE.
3- Have you tried to connect the board in different slots ? what was the results.
No we didn’t tried it as board recovered itself so not needed.
4- Have you tried to reset the board ? what was the results.
No we didn’t tried it as board recovered itself so not needed.
5- Have you tried to replace the board ? what was the results.
No we didn’t tried it as board recovered itself so not needed.
6- Please help to provide the basics log for the following
Logs attached.
Root Cause
-
- The processing timer of the service board did not respond to the heartbeat messages received from control board on time, in such case the control board reset the service board as a trial to recovery.
- Hardware alarm is generated on the service board memory chip.
- The issue did not happen only once recently, it happen multiple times.
- The first alarm in the system start on 01/06/2023 10:50:41as shown below:
According to the board reset-record, the last board reset happened on 06-01 03:14:33 with reason “The link fault processing timer of the board times out” also this is not done 1 times but the issue repeated many times as shown below:
-
- Which such reset-record, reason this means the service-board did not respond to the heartbeat packets sent by the control board on time, which leads to the communication between the control board and service board interrupted.
Solution
There is also hardware alarm reported on the board related to its memory chip as shown below:
Suggestions
Need to replace the service board with identical spare, then monitor the results for some time.