Electronic Phenotype for Advanced Chronic Kidney Disease in a Veteran Health Care System Clinical Database: Systems-Based Strategy for Model Development and Evaluation

Background: Identifying advanced (stages 4 and 5) chronic kidney disease (CKD) cohorts in clinical databases is complicated and often unreliable. Accurately identifying these patients can allow targeting this population for their specialized clinical and research needs. Objective: This study was conducted as a system-based strategy to identify all prevalent Veterans with advanced CKD for subsequent enrollment in a clinical trial. We aimed to examine the prevalence and accuracy of conventionally used diagnosis codes and estimated glomerular filtration rate (eGFR)-based phenotypes for advanced CKD in an electronic health record (EHR) database. We sought to develop a pragmatic EHR phenotype capable of improving the real-time identification of advanced CKD cohorts in a regional Veterans health care system. Methods: Using the Veterans Affairs Informatics and Computing Infrastructure services, we extracted the source cohort of Veterans with advanced CKD based on a combination of the latest eGFR value ≤ 30 ml·min –1 ·1.73 m –2 or existing International Classification of Diseases (ICD)-10 diagnosis codes for advanced CKD (N18.4 and N18.5) in the last 12 months. We estimated the prevalence of advanced CKD using various prior published EHR phenotypes


Introduction
Advanced chronic kidney disease (CKD) progressing to end-stage kidney disease (ESKD) is a huge burden for the US health care system [1]. Patients with advanced CKD are at increased risk for adverse outcomes, including progression to ESKD and death. Prior studies show that providing pre-ESKD nephrology care and comprehensive pre-ESKD education improves clinical outcomes; reduces health care costs; and increases home dialysis, transplantation utilization, and patient survival [2][3][4][5][6]. Despite these positive outcomes, approximately 40% of patients with incident ESKD in the United States have either limited (less than 6 months) or no access to nephrology care before initiating dialysis and even fewer (<1%) receive kidney disease education services [7,8]. Accurately identifying the advanced (stages 4 and 5) CKD population at risk for ESKD can facilitate targeted needs assessment studies to improve pre-ESKD nephrology care and provide comprehensive pre-ESKD education for this high-risk population [9].
Clinically, CKD is diagnosed by sustained alterations in the structure or function of the kidney for more than 3 months with implications for health. The Kidney Disease: Improving Global Outcomes (KDIGO) Work Group recommends staging CKD based on cause, estimated glomerular filtration rate (eGFR), and albuminuria [10]. Unfortunately, the asymptomatic nature of CKD creates a lack of awareness for patients and providers alike [1,11]. Investigators conventionally use the International Classification of Diseases (ICD)-based diagnosis codes or electronic health record (EHR)-based phenotypes according to the eGFR to identify patients with CKD in clinical databases [12]. These phenotypes recommend using two eGFR values below 60 ml·min -1 ·1.73 m -2 , obtained more than 90 days apart, to identify a population with CKD of stage 3 or higher in the databases [12]. However, similar guidance is not available to identify an advanced CKD population within clinical databases, and epidemiological investigations frequently use a single latest eGFR value while ascertaining the advanced CKD burden within the database [3,13,14]. Considering the variability in the frequency of measurement, pragmatic fluctuations in the serum creatinine value and concerns for intervening acute kidney injury (AKI) episodes can cause errors in classifying one's CKD stage [15]. Thus, there is a need to establish an optimal EHR-based method capable of identifying patients with advanced CKD within clinical databases in real time to improve kidney disease care and research.
Using the clinical database of the North Florida/ South Georgia (NF/SG) Veterans Health System (VHS), we sought to assess the burden of advanced CKD prevalence in real time using various EHR-recorded advanced CKD phenotypes within the Veterans Health Administration (VHA) [14,16]. We further examined the accuracy of different EHR phenotypes for advanced CKD by prospectively following the cohorts for 6 months and assessed the number of Veterans remaining in the advanced CKD stage after the initial classification. Furthermore, considering the lack of consensus on EHR phenotyping for identifying an advanced CKD cohort within clinical databases, we also sought to explore a new tiered pragmatic method for estimating the Veteran cohort with advanced CKD in real time.

Data Source and Cohort Selection
This study was conducted as a system-based strategy to identify all prevalent Veterans with advanced (stages 4 and 5) nondialysis CKD. The identified participants were then approached for enrollment in the Trial to Evaluate and Assess the effects of Comprehensive pre-ESKD education on Home dialysis among Veterans (TEACH-VET), which aims to assess the impact of a universal approach for comprehensive pre-ESKD education for all patients with advanced CKD on various clinical, patient-reported, and health services outcomes [17]. We used the Veterans Affairs (VA) Corporate Data Warehouse (CDW) and VA Informatics and Computing Infrastructure (VINCI) to identify the advanced CKD cohort. In brief, the VINCI services initially queried the VA CDW in April 2021 to identify all Veterans registered for service at NF/SG VHS during the 12 months prior to the data extraction (source cohort). The Veterans with an active laboratory value of creatinine were identified and their eGFR was calculated by applying the Modification of Diet in Renal Disease (MDRD) equation [18]. The use of the MDRD equation was determined by the then-prevalent method of eGFR estimation for the VINCI services. We then created a source cohort of Veterans with advanced CKD who either had the latest eGFR value ≤30 ml·min -1 ·1.73m -2 (index eGFR) or an existing ICD-10 diagnosis code for advanced CKD (ICD-10 codes: N18.4 and N18.5) within the last 12 months (Figure 1). Patients on dialysis were excluded using the ICD-10 and Current Procedural Terminology (CPT) codes for dialysis (see Table S1 in Multimedia Appendix 1). The prevalence of advanced CKD was estimated in real time using various methods, including advanced CKD diagnosis codes or by eGFR phenotypes described in the literature (ie, by ICD-10 advanced CKD diagnosis codes, by using single [index] eGFR < 30 ml·min -1 ·1.73 m -2 , and by using the two eGFR values 90 days apart with the index eGFR <30 ml·min -1 ·1.73 m -2 and 90-day prior eGFR < 60 ml·min -1 ·1.73 m -2 ) [14,16,19]. The cumulative prevalence of CKD was calculated by combining the data extracted over 6 months. Patient-level data included age, sex, race, ethnicity, religion, marital status, Veteran era, and residential zip codes used for defining the rurality by applying Rural-Urban Commuting Area codes. Statistical analyses were performed using R software version 4.0.4 (R Core Team, 2021) [20].
The source cohort (ie, April 2021 cohort) was divided into a high-, intermediate-, and low-risk advanced CKD cohort utilizing the latest (index) eGFR and 90-day prior eGFR and diagnostic codes (Table 1). Patients with both eGFR values below 30 ml·min -1 ·1.73 m -2 were considered to have a high risk of advanced CKD, whereas those with one of the two eGFR values less than 30 ml·min -1 ·1.73 m -2 but with the other value ≥30 but <60 ml·min -1 ·1.73 m -2 were considered to have an intermediate risk of having advanced CKD. The intermediate-risk cohort with an index eGFR below 30 ml·min -1 ·1.73 m -2 was further refined by excluding patients diagnosed with AKI within the 90 days prior to their latest eGFR values using ICD-10 codes. Veterans with both eGFR values ≥30 ml·min -1 ·1.73 m -2 but with diagnosis codes for advanced CKD were regarded as having a low risk of advanced CKD ( Table 1). The source cohort was followed prospectively for 6 consecutive months until September 2021 using similar queries to examine the eGFR laboratory behavior of the patients with advanced CKD.

Outcomes
The primary goal of this study was to assess the prevalence and accuracy of various EHR phenotypes for extraction of an advanced CKD cohort in a clinical database utilizing diagnosis codes and eGFR models (ie, by ICD-10 advanced CKD diagnosis codes, by using single latest [index] eGFR <30 ml·min -1 ·1.73 m -2 , and by using the two eGFR values 90 days apart, with the index eGFR <30 and 90 days prior eGFR <60) and our tiered EHR phenotype (high, intermediate, and low risk). Considering that nearly one-third of Veterans do not regularly obtain laboratory testing from within the VA, the denominator population for estimating the prevalence of advanced CKD was judged by only including the Veterans with a valid creatinine value measured over the prior 12 months.
Considering EHR phenotypes as a standard for identification of patients with advanced CKD, cross-sectional accuracy for identifying patients with advanced CKD using only ICD-10 codes was assessed by comparison with laboratory-based eGFR EHR phenotypes, analyzed by calculating the sensitivity and positive predictive value (PPV). A manual chart review was conducted in a small randomly selected sample to identify errors related to automated advanced nondialysis CKD identification. Prospective accuracy of all EHR phenotypes, including our pragmatic tiered approach of high-, intermediate-, and low-risk advanced CKD cohorts, was assessed by ascertaining the longitudinal follow-up of laboratory values and identifying the likelihood of remaining in the advanced CKD stage at the end of the 6-month follow-up.

Ethical Approval
The regulatory approvals for the study were obtained from the institutional review board of the University of Florida (201900870). The study data are stored in secured systems at NF/SG VHS as per the institutional guidelines.

Results
We identified 133,756 active enrollees with 93,216 enrollees having at least one value of measured creatinine during an outpatient or inpatient visit at NF/SG VHS in the prior 12 months. After excluding the Veterans with ESKD by additional ICD and CPT codes, a source cohort of 1759 Veterans was identified as either having the latest eGFR ≤30 ml·min -1 ·1.73 m -2 or an existing ICD-10 diagnosis code for advanced CKD (ICD-10 codes N18.4 and N18.5) within the last 12 months ( Figure 1). The overall cohort had a mean age of 75 (SD 11.1) years and consisted of a predominantly male (95.8%) and white (67.8%) population. These Veterans lived approximately 126.3 (SD 229.5) miles from the nephrology service-providing VA center, with rural Veterans constituting a significant proportion (751/1759, 42.7%) of the cohort (  Table 3). The cumulative cohort over the 6 months yielded 1840 Veterans with high and intermediate risk (2% cumulative prevalence). The sensitivity of diagnosis codes was only 55%-65% compared to the eGFR phenotypes, and the PPV of ICD-10 diagnosis codes for advanced CKD varied between 55% and 74% ( Table 4).
The source cohort was followed prospectively for 6 months to examine the variations and likelihood of a sustained reduced eGFR <30 ml·min -1 ·1.73 m -2 across various EHR phenotypes. A total of 981 (55.8%) of the 1759 Veterans had at least one subsequent eGFR measurement in the initial April cohort ( Table  5). The probability of any subsequent eGFR measurement above 30 ml·min -1 ·1.73 m -2 after the index eGFR in the cohort defined by ICD codes was 38.3%, and was approximately 12.7% and 12.8 % in cohorts defined by index eGFR <30 ml·min -1 ·1.73 m -2 and two eGFR phenotypes with index eGFR < 30 ml·min -1 ·1.73 m -2 and 90-day prior eGFR < 60 ml·min -1 ·1.73 m -2 , respectively. Similarly, the probability of having any subsequent eGFR value above 30 ml·min -1 ·1.73 m -2 after the index eGFR measurement was 7.1%, 35.7%, and 90% in the high-, intermediate-, and low-risk group, respectively. The probability of Veterans remaining in an advanced CKD stage (stages 4 and 5) noted by the recent eGFR <30 ml·min -1 ·1.73 m -2 at the end of follow-up was 65.3% in the group identified by the ICD codes, whereas the probability improved to 90% in the group defined by single (index) eGFR <30 ml·min -1 ·1.73 m -2 and the group defined by the index eGFR and 90-day prior eGFR method. Similarly, the probability of Veterans remaining in an advanced CKD stage at the end of the follow-up period was 94.2%, 71.0%, and 16.1% for high-, intermediate-, and low-risk groups, respectively ( Figure 2, Table 5, and Table S2 in Multimedia Appendix 1).

Principal Findings
Accurate identification of an advanced CKD cohort within a clinical database can allow large health care organizations to provide targeted evidence-based clinical care, conduct system-wide needs assessment studies, and facilitate clinical and epidemiological outcome studies. Several EHR-based models to identify CKD using ICD codes and laboratory values have been published [12,21,22]. While there is a reasonable consensus regarding the EHR-based strategies to define CKD within a clinical database, no targeted study has examined the feasibility of extracting an advanced CKD cohort in such databases. Exploring the clinical database of one of the largest regional Veterans health care systems in the country, we identified several coding, identification, and accuracy-related concerns in extracting an advanced CKD cohort.
Researchers have conventionally used the provider diagnosis codes to identify and stage patients with CKD in clinical databases. Using the more accurate eGFR-based definitions, several investigators have shown that identifying CKD cohorts purely by diagnostic codes underestimates its true prevalence [23]. For example, Diamantidis et al [24] showed that the clinical recognition of CKD utilizing diagnostic codes was only 11.8% among Medicare beneficiaries. In a systemic review of studies primarily conducted on non-VHA health care databases, Grams et al [23] found that the coding accuracy for CKD varies widely between 8% and 83%, depending on providers' awareness, and rises with the comorbidity burden and severity of CKD.
Few investigators have evaluated the use and accuracy of CKD diagnosis codes in the VHA clinical database. In a recent analysis of the national VHA database, Saran et al [16] estimated the burden and cost of CKD care on VHA among over 6 million VHA-registered Veterans. While the investigators did not examine the coding accuracy, they found its overall use to be very low (3.2%) compared to much higher estimates (8.02%-27%) obtained using laboratory values [16]. Similar results were recently obtained by Bansal et al [19] in a selective cohort of Veterans with diabetes/hypertension at Veteran Integrated Service Network 17. They found that the laboratory-based prevalence of CKD was approximately 36%, but only 44% of them had diagnosis codes for CKD [19]. Similarly, Norton et al [25] found that 63% of entries lacked CKD codes in a military health system. In conjunction with these reports, our analysis showed that the sensitivity and PPV of diagnosis codes, when compared to the eGFR-based phenotypes, to identify advanced CKD is low, in the range of 55%-65% and 55%-74 %, respectively. Our study further shows that when prospectively followed, nearly one-third of the cohort defined by diagnosis codes had an eGFR value over 30 ml·min -1 ·1.73 m -2 at the end of 6-month study. Overall, our findings confirm that the utility and accuracy of diagnosis codes for identifying advanced CKD cohorts in the VHA clinical database is poor.
There are also concerns about using an eGFR-based staging system in clinical databases. EHR-based phenotypes require laboratory measurements of creatinine; however, the regular and periodic availability of creatinine may be inconsistent in the clinical databases. For example, Norton et al [14] showed that only 55% of the study sample had eGFR measurements while validating their CKD EHR phenotype. Similarly, a study examining the VA database showed that only 65% of the VA users had any measurements of eGFR during the study period [16]. This lack of availability of eGFR measures can generate errors in the measurement of disease burden. Further, while the definition of CKD requires the demonstration of a persistent reduction of renal function, many studies report CKD staging statistics using a single eGFR value, with a significant fraction of the cohort lacking the second reported eGFR value. For example, in an analysis performed by the National Kidney Disease Education Program Workgroup, 31% of patients with stage-4 CKD and 36% of patients with stage-5 CKD did not have a prior eGFR <60 ml·min -1 ·1.73 m -2 value available [14]. Similarly, in the analysis by Saran et al [16] examining the burden of CKD in the VA database, only approximately 27% of Veterans had two eGFR measurements more than 90 days apart, raising concerns about the accuracy of the disease burden. However, in our analysis, focusing on the advanced stages of CKD, we found that over 1723 (98%) of Veterans had two eGFR values reported for the initial source cohort, substantially increasing the reliability of screening for advanced CKD.
Additionally, we noticed that over 55% (n=981) of the source cohort had subsequent measurements of eGFR over the prospective 6 months (Table 5), further providing a more robust overall reliability of our advanced CKD estimates.
While using eGFR-based phenotypes improves the identification of CKD, staging CKD into stages 3, 4, and 5 can be complex in a clinical database due to physiologic variability in creatinine levels, performance of biochemical tests, frequency of measurements, and intercurrent illness and volume status [13]. Examining such variations in repeat estimations over 3-6 months in the VHA database, Shahinian et al [26] reported that nearly 30% of patients with stage-4 CKD and 6% of patients with stage-5 CKD had eGFR values ≥30 ml·min -1 ·1.73 m -2 in the repeat measurements, thus misclassifying as advanced CKD instead of CKD stage 3 [26]. These inaccuracies can lead to the misidentification of patients with advanced CKD, creating misappropriations of clinical resource allocation or errors in research outcomes for studies that target a specific advanced CKD population.
Considering these inherent limitations of eGFR and diagnostic codes, we sought to refine the predictive accuracy of isolating an advanced CKD cohort for TEACH-VET by categorizing our EHR-derived source cohort into high-, intermediate-, and low-risk advanced CKD cohorts using the two latest eGFR values obtained 90 days apart. Assessing the cohort prospectively for 6 months, we found a very high and graded level of stability with our tiered approach, with 94% and 71% of Veterans in the high-risk and intermediate-risk groups having a eGFR less than 30 ml·min -1 ·1.73 m -2 at the study end point, thus remaining in an advanced CKD stage. These findings suggest that such an operational definition can significantly improve clinical and research decision-making and optimize resource allocations, which is currently used to prioritize and enroll Veterans in a clinical study targeting advanced CKD [17]. At the same time, we show that approximately 16% of those with a low risk for advanced CKD had an eGFR below 30 ml·min -1 ·1.73 m -2 at the 6-month follow-up, highlighting the high-risk individuals even among those with apparent inaccuracies in diagnosis codes.
Our study explored various available methods to provide a more optimal method to obtain the population statistics for an advanced CKD burden and stratified this cohort based on their longitudinal probability of requiring stage-specific care. Examining real-time data and accurately determining the denominator to only those with an available eGFR estimation within the prespecified 12-month period, we found that the prevalence of advanced CKD (high and intermediate risk) was 1.5%, which is 2-3 times higher compared to the US general population estimates (0.5%) derived from National Health and Nutrition Examination Survey (NHANES) enrollees [1,27], but is less than VHA estimates (1.62%) provided by Saran et al [16]. Even based on the conservative estimates and accounting for all the VA users as the denominator, the prevalence of advanced CKD seems to be higher than that of the general population (Table 3). Recently, VHA has implemented a clinical tool for identifying a CKD cohort based on a single eGFR measurement [28]. Further refinements in the tool by implementing the proposed tiered risk approach to identify an advanced CKD population can allow the VHA to implement judicious allocation of care and resources to those in the highest need. A manual chart review showed an error rate of 11%, mainly attributed to the Veterans being on dialysis. Although the VA database can be linked to the United States Renal Data System (USRDS) database and help exclude dialysis patients, there is a lag in the USRDS data and hence this might not be helpful when the need for identification of advanced CKD in real time arises, as intended in our study for enrollment into a clinical trial [8,17]. In the VHS system, using the community care dialysis list can further increase the sensitivity of the screened list and reduce the error rate by excluding the Veterans who are currently receiving dialysis.

Limitations
Our study has a few limitations. In recent times, investigators have described advanced EHR algorithms to identify patients with CKD [29]. However, such phenotypes require complex machine-learning algorithms and validation for the target population, and their application in staging CKD is even further away. This study aimed to explore a pragmatic model for identifying Veterans with high, intermediate, and low risk of advanced CKD in real time that can be easily implemented in routine practice and across a large health care system. Second, we did not incorporate the presence or severity of albuminuria within our parsimonious risk model. However, we believe that it is unlikely to improve upon the model for several reasons. Measurement of albuminuria or even proteinuria is uncommon in clinical databases, including the VHA database, and frequently requires the use of proteinuria categorization on routine urinalysis. The risk for complications and adverse outcomes is significantly high for advanced CKD, as highlighted in the KDIGO classification, irrespective of the degree of albuminuria. Considering the unreliable availability of urine protein measurement, it is likely to be of limited additional value, if any [10]. We acknowledge that the true significance of our parsimonious approach will require studies examining longitudinal clinical outcomes. Third, our eGFR values are based on the creatinine values and utilizing the MDRD equation, according to the then-prevalent practices of the VA CDW at the time of the study. Since the overall intention of the study was to evaluate the methodologies for identifying advanced CKD cohorts within a health care system such as VHA, this is unlikely to change the outcome of the study. Future analyses will need to consider the updated CKD-Epidemiology Collaboration equations incorporating creatinine and cystine values for more accurate staging of CKD. Finally, it needs to be mentioned that our results are applicable only among the active VHA users rather than all VHA-registered Veterans, and thus may misrepresent the true burden of advanced CKD among the entire Veteran population. EHR phenotypes, in general, may exclude people with reduced access to care.

Conclusion
We found that the prevalence of advanced CKD at NF/SG VHS is higher than that in the general population as per various EHR phenotypes, including our EHR model. There is significant discordance between coding and laboratory parameters for the identification of advanced CKD, consistent with other studies.
EHR phenotypes based on CKD diagnosis codes alone are insufficient for identification of an advanced CKD cohort in a clinical database. We report a simplified and pragmatic EHR-based model to identify advanced CKD within a regional VHS in real time with a tiered approach that allows allocation of resources to the groups requiring immediate attention and are at risk of progression to ESKD. Further testing of this model is needed to determine its broader applicability across the VHA. If validated, similar models can be tested across the non-VHA databases to identify the true burden of advanced CKD and target clinical care in real time.