原文链接 原文 部分开源代码 数据集介绍
复现 from CLHLS.get_3year_data import get_period_data中的CLHLS代表文件夹后的.参考 用python处理时要把.sav转为.csv文件:法一:用spss导出不过发现导出的文件不太符合规则,选择utf-8,但是输出结果与作者的一致,因此应该没有问题。 法二:用R中的模块
1 2 library( foreign) write.table( read.spss( "inFile.sav" ) , file= "outFile.csv" , quote = TRUE , sep = "," )
文件Descriptive Data.ipynb用于数据预处理。Python enumerate() 函数 .columns属性以返回给定 DataFrame 的列标签 csv文件形式和sav一样所以直接用sav看更直观,csv只是处理数据比较方便,变量视图可以看到问过的问题和答案,变量_5代表随访到2005年时做过的调查 这些数据虽然有些是02到18有些是05到18,但是应该调查人群不同 .key()可以获得字典中所有的键,而不是键值。 语法dictionary.keys() 使用示例>>> list({'Chinasoft':'China', 'Microsoft':'USA'}.keys()) ['Chinasoft', 'Microsoft'] 在pandas中,value_counts常用于数据表的计数及排序,它可以用来查看数据表中,指定列里有多少个不同的数据值,并计算每个不同值有在该列中的个数,同时还能根据需要进行排序参考 。列表(List) 函数map() 会根据提供的函数对指定序列做映射。第一个参数 function 以参数序列中的每一个元素调用 function 函数,返回包含每次 function 函数返回值的新列表。 以下代码段的解释:
1 2 3 4 5 6 7 if start_year != 2011 : three_years_later_columns = list ( map (lambda x: x + year_index if x != 'trueage' else 'vage' + year_index, basic_columns)) else : three_years_later_columns = list (map (lambda x: x + year_index, basic_columns))
这个操作的作用就是获得三年后随访的变量名,11年前,后续年份随访调查的年龄字段叫vage_年份; 11年后,后续年份随访调查的年龄字段叫trueage_年份,这里的x取值和if start_year != 2011:这句话有关 因为个人信息问题只在初始年记录,后续年没有,把三年后的个人信息问题去掉留下三年后的问题信息,然后把起始年的信息加上,代码如下
1 2 3 4 5 for personal_stable_variable in list (map (lambda x: x + year_index, ['a2' , 'f1' , 'a1' , 'a41' , 'a43' ])): three_years_later_columns.remove(personal_stable_variable) total_columns = basic_columns + three_years_later_columns
筛选3年后随访还活着且没失访的老年人,比如dth11_14代表11年开始调查到14年还有是否活着0代表14年还存活 下面的代码用于处理有缺失值的问卷变量: 因为问卷中8代表缺失值,而其中一些变量值可取8因此先进行分类讨论:
1 2 3 4 5 6 7 8 questions_may_contain8 = ['vage' , 'trueage' , 'a2' , 'a41' , 'f1' , 'b21' , 'b22' , 'b23' , 'b24' , 'b25' , 'b26' , 'b27' , 'd75' , 'd86' , 'd12' , 'c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c16' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' ] year_index = '_' + str (start_year + 3 )[-1 ] if start_year + 3 < 2010 else '_' + str (start_year + 3 )[-2 :] later_questions_may_contain8 = list (map (lambda x: x + year_index, questions_may_contain8)) total_questions_may_contain8 = questions_may_contain8 + later_questions_may_contain8
而剩余变量含8的可能会是缺失值,replace函数 ,这段代码把由数字代表的缺失值用NAN描述了。
1 2 3 4 5 6 7 8 9 10 11 12 variable_needed_processed = list (set (df.columns) - set (total_questions_may_contain8)) new_df = df.copy() for var in variable_needed_processed: new_df[var] = new_df[var].replace(8 , np.nan) new_df[var] = new_df[var].replace(88 , np.nan) new_df[var] = new_df[var].replace(888 , np.nan) new_df[var] = new_df[var].replace(8888 , np.nan) new_df[var] = new_df[var].replace(9 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(9999 , np.nan)
下面代码似乎在计算各个变量中缺失的概率,df.isnull() ,df.isnull().any()则会判断哪些”列”存在缺失值,new_df.loc[:, new_df.isnull().any()].isnull().sum()就是把所有有缺失值的列累加,最后除以总数,就是缺失的概率。
1 2 3 4 5 6 7 8 print ('Missing values situation' ) print ( (new_df.loc[:, new_df.isnull().any ()].isnull().sum () / len (df)).apply(lambda x: str ('%.2f' % (x * 100 )) + '%' )) for col in new_df.columns[new_df.isnull().any ()]: mode = new_df[col].value_counts().sort_values(ascending=False ).index[0 ] new_df[col].fillna(mode, inplace=True )
去掉MMSE问题全部不能回答【总共23个问题】的人员,也就是删掉一些行数,new_df = new_df[new_df.isnull().sum(axis=1) < 23]代表有任何一个问题能回答就保留,最后该段代码统计减去未统计MMSE的人数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 MMSE_questions = ['c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' ] MMSE_questions_later = list (map (lambda x: x + year_index, MMSE_questions)) for var in MMSE_questions: new_df[var] = new_df[var].replace(8 , np.nan) new_df[var] = new_df[var].replace(88 , np.nan) new_df[var] = new_df[var].replace(888 , np.nan) new_df[var] = new_df[var].replace(8888 , np.nan) new_df[var] = new_df[var].replace(9 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(9999 , np.nan) print ('Samples before drop current MMSE missing : ' , len (new_df))new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ] print ('Samples before drop current MMSE missing : ' , len (new_df))for var in MMSE_questions_later: new_df[var] = new_df[var].replace(8 , np.nan) new_df[var] = new_df[var].replace(88 , np.nan) new_df[var] = new_df[var].replace(888 , np.nan) new_df[var] = new_df[var].replace(8888 , np.nan) new_df[var] = new_df[var].replace(9 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(9999 , np.nan) print ('Samples before drop later MMSE missing : ' , len (new_df))new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ] print ('Samples before drop later MMSE missing : ' , len (new_df))
统计大于60岁的人数,并于之前对比。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 print ('Before drop age<60 : ' )print ('Wave 1: ' ,len (first_wave_new))print ('Wave 2: ' ,len (second_wave_new))print ('Wave 3: ' ,len (third_wave_new))print ('Wave 4: ' ,len (fourth_wave_new))first_wave_new = first_wave_new[first_wave_new.trueage>=60 ] second_wave_new = second_wave_new[second_wave_new.trueage>=60 ] third_wave_new = third_wave_new[third_wave_new.trueage>=60 ] fourth_wave_new = fourth_wave_new[fourth_wave_new.trueage>=60 ] print ('After drop age<60 : ' )print ('Wave 1: ' ,len (first_wave_new))print ('Wave 2: ' ,len (second_wave_new))print ('Wave 3: ' ,len (third_wave_new))print ('Wave 4: ' ,len (fourth_wave_new))
数据预处理总述 数据处理阶段一,纳入了 2002 年至 2014 年间的 4 波非重叠的 3 年调查数据,和所需的各个列变量【调查变量】,我们首先纳入了 2002 年至 2014 年间的 4 波非重叠的 3 年调查数据。参与者选择的详细流程图如图 1 所示。所有参与者每 3 年接受一次随访(一波)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 def get_data_between_start_to_end (df,start_year ): short_year = str (start_year+3 )[-1 ] if start_year+3 < 2010 else str (start_year+3 )[-2 :] variable_list = [] index = 0 minux_index = len (short_year)+1 for i,variable in enumerate (df.columns): if variable[-minux_index:] == '_' + short_year: index = i break variable_list.append(variable) print (index) for j in range (index,len (df.columns)): if df.columns[j] == 'dth' +str (start_year+3 )[-2 :]+'_' +str (start_year+6 )[-2 :] or df.columns[j][-3 :] == '_' +str (start_year+7 )[-2 :]: break variable_list.append(df.columns[j]) print (variable_list) return df.loc[:,variable_list]
数据预处理和统计,根据变量dth02_05统计出三年间存活,死亡,失踪的各个项目人数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 """ 数据预处理和统计 """ def get_df_peo_3years_situation (data02to05,data05to08,data08to11,data11to14 ): data11to14.dth11_14 = pd.to_numeric(data11to14.dth11_14,errors = 'coerce' ) data08to11.loc[data08to11[data08to11.dth08_11 == 2011 ].index,'dth08_11' ] = 0 observations = [] survivor = [] death = [] missing = [] df_dict = {'df1' :data02to05,'df2' :data05to08,'df3' :data08to11,'df4' :data11to14} df_var_dict = {'df1' :'dth02_05' ,'df2' :'dth05_08' ,'df3' :'dth08_11' ,'df4' :'dth11_14' } for df in df_dict.keys(): print (df) print (df_dict[df]) print (len (df_dict[df])) counter = df_dict[df][df_var_dict[df]].value_counts() observations.append(len (df_dict[df])) survivor.append(counter[0 ]) death.append(counter[1 ]) missing.append(counter[-9 ]) return pd.DataFrame({'start year observations' :observations, 'survivor' :survivor, 'death' :death, 'missing' :missing},index = ['2002 to 2005' ,'2005 to 2008' ,'2008 to 2011' ,'2011 to 2014' ]) situation = get_df_peo_3years_situation(data02to05,data05to08,data08to11,data11to14) print ('Number of investgated elders situations : ' ,'\n' ,situation)
筛选出还活着的列表然后取特定的列变量,把三年后的初始个人信息问题去掉
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 def get_wave_data (df, start_year ): basic_columns = ['a1' , 'trueage' , 'a2' , 'f1' , 'f41' , 'f34' , 'a41' , 'a43' , 'a51' , 'b21' , 'b22' , 'b23' , 'b24' , 'b25' , 'b26' , 'b27' , 'd71' , 'd72' , 'd75' , 'd81' , 'd82' , 'd86' , 'd91' , 'd92' , 'd31' , 'd32' , 'd11b' , 'd11c' , 'd11d' , 'd11e' , 'd11f' , 'd11g' , 'd11h' , 'd12' , 'e1' , 'e2' , 'e3' , 'e4' , 'e5' , 'e6' , 'g15a1' , 'g15b1' , 'g15c1' , 'g15d1' , 'g15e1' , 'g15f1' , 'g15g1' , 'g15h1' , 'g15i1' , 'g15j1' , 'g15k1' , 'g15l1' , 'g15m1' , 'g15n1' , 'g15o1' , 'c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c16' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' , ] year_index = '_' + str (start_year + 3 )[-1 ] if start_year + 3 < 2010 else '_' + str (start_year + 3 )[-2 :] if start_year != 2011 : three_years_later_columns = list ( map (lambda x: x + year_index if x != 'trueage' else 'vage' + year_index, basic_columns)) print (three_years_later_columns) else : three_years_later_columns = list (map (lambda x: x + year_index, basic_columns)) print (three_years_later_columns) for personal_stable_variable in list (map (lambda x: x + year_index, ['a2' , 'f1' , 'a1' , 'a41' , 'a43' ])): three_years_later_columns.remove(personal_stable_variable) total_columns = basic_columns + three_years_later_columns death_index_var = 'dth' + str (start_year)[-2 :] + '_' + str (start_year + 3 )[-2 :] live_df = df[df[death_index_var] == 0 ] print (live_df) wave_data = live_df[total_columns] print (wave_data) for column in wave_data.columns: wave_data[column] = pd.to_numeric(wave_data[column], errors='coerce' ) return wave_data
统计各个项的缺失概率,并用众数填补变量的缺失值并把MMSE问题全部不能回答的人去掉
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 def replace_abnormal_and_fill_nan (df, start_year ): questions_may_contain8 = ['vage' , 'trueage' , 'a2' , 'a41' , 'f1' , 'b21' , 'b22' , 'b23' , 'b24' , 'b25' , 'b26' , 'b27' , 'd75' , 'd86' , 'd12' , 'c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c16' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' ] year_index = '_' + str (start_year + 3 )[-1 ] if start_year + 3 < 2010 else '_' + str (start_year + 3 )[-2 :] later_questions_may_contain8 = list (map (lambda x: x + year_index, questions_may_contain8)) total_questions_may_contain8 = questions_may_contain8 + later_questions_may_contain8 variable_needed_processed = list (set (df.columns) - set (total_questions_may_contain8)) new_df = df.copy() for var in variable_needed_processed: new_df[var] = new_df[var].replace(8 , np.nan) new_df[var] = new_df[var].replace(88 , np.nan) new_df[var] = new_df[var].replace(888 , np.nan) new_df[var] = new_df[var].replace(8888 , np.nan) new_df[var] = new_df[var].replace(9 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(9999 , np.nan) print ('Missing values situation' ) print ( (new_df.loc[:, new_df.isnull().any ()].isnull().sum () / len (df)).apply(lambda x: str ('%.2f' % (x * 100 )) + '%' )) for col in new_df.columns[new_df.isnull().any ()]: mode = new_df[col].value_counts().sort_values(ascending=False ).index[0 ] new_df[col].fillna(mode, inplace=True ) MMSE_questions = ['c11' , 'c12' , 'c13' , 'c14' , 'c15' , 'c21a' , 'c21b' , 'c21c' , 'c31a' , 'c31b' , 'c31c' , 'c31d' , 'c31e' , 'c32' , 'c41a' , 'c41b' , 'c41c' , 'c51a' , 'c51b' , 'c52' , 'c53a' , 'c53b' , 'c53c' ] MMSE_questions_later = list (map (lambda x: x + year_index, MMSE_questions)) for var in MMSE_questions: new_df[var] = new_df[var].replace(8 , np.nan) new_df[var] = new_df[var].replace(88 , np.nan) new_df[var] = new_df[var].replace(888 , np.nan) new_df[var] = new_df[var].replace(8888 , np.nan) new_df[var] = new_df[var].replace(9 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(9999 , np.nan) print ('Samples before drop current MMSE missing : ' , len (new_df)) new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ] print ('Samples before drop current MMSE missing : ' , len (new_df)) for var in MMSE_questions_later: new_df[var] = new_df[var].replace(8 , np.nan) new_df[var] = new_df[var].replace(88 , np.nan) new_df[var] = new_df[var].replace(888 , np.nan) new_df[var] = new_df[var].replace(8888 , np.nan) new_df[var] = new_df[var].replace(9 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(99 , np.nan) new_df[var] = new_df[var].replace(9999 , np.nan) print ('Samples before drop later MMSE missing : ' , len (new_df)) new_df = new_df[new_df.isnull().sum (axis=1 ) < 23 ] print ('Samples before drop later MMSE missing : ' , len (new_df)) return new_df
整合并筛选出>60岁的人
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 def get_period_data (): print ('Reading data ...' ) data_02to18 = pd.read_csv('./data/clhls_2002_2018_utf8.csv' ,low_memory=False ) data_05to18 = pd.read_csv('./data/clhls_2005_2018_utf8.csv' ,low_memory=False ) data_08to18 = pd.read_csv('./data/clhls_2008_2018_utf8.csv' ,low_memory=False ) data_11to18 = pd.read_csv('./data/clhls_2011_2018_utf8.csv' ,low_memory=False ) print ('Getting 3 year periods data ...' ) data02to05 = get_data_between_start_to_end(data_02to18,2002 ) data05to08 = get_data_between_start_to_end(data_05to18,2005 ) data08to11 = get_data_between_start_to_end(data_08to18,2008 ) data11to14 = get_data_between_start_to_end(data_11to18,2011 ) situation = get_df_peo_3years_situation(data02to05,data05to08,data08to11,data11to14) print ('Number of investgated elders situations : ' ,'\n' ,situation) print ('Selecting needed variables ...' ) first_wave = get_wave_data(data02to05,2002 ) second_wave = get_wave_data(data05to08,2005 ) third_wave = get_wave_data(data08to11,2008 ) fourth_wave = get_wave_data(data11to14,2011 ) print ('Filling missing values ...' ) first_wave_new = replace_abnormal_and_fill_nan(first_wave,2002 ) second_wave_new = replace_abnormal_and_fill_nan(second_wave,2005 ) third_wave_new = replace_abnormal_and_fill_nan(third_wave,2008 ) fourth_wave_new = replace_abnormal_and_fill_nan(fourth_wave,2011 ) print ('Before drop age<60 : ' ) print ('Wave 1: ' ,len (first_wave_new)) print ('Wave 2: ' ,len (second_wave_new)) print ('Wave 3: ' ,len (third_wave_new)) print ('Wave 4: ' ,len (fourth_wave_new)) first_wave_new = first_wave_new[first_wave_new.trueage>=60 ] second_wave_new = second_wave_new[second_wave_new.trueage>=60 ] third_wave_new = third_wave_new[third_wave_new.trueage>=60 ] fourth_wave_new = fourth_wave_new[fourth_wave_new.trueage>=60 ] print ('After drop age<60 : ' ) print ('Wave 1: ' ,len (first_wave_new)) print ('Wave 2: ' ,len (second_wave_new)) print ('Wave 3: ' ,len (third_wave_new)) print ('Wave 4: ' ,len (fourth_wave_new)) return first_wave_new,second_wave_new,third_wave_new,fourth_wave_new
第二部分获得CI score,depression score,ADL score三项的分数 先定位到CI score,depression score,ADL score三项的变量
1 2 3 4 5 6 7 8 new_df = df.copy() current_questions = ['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ] year_index = '_' +str (start_year+3 )[-1 ] if start_year + 3 < 2010 else '_' +str (start_year+3 )[-2 :] three_year_questions = list (map (lambda x:x+year_index,['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ])) questions = current_questions + three_year_questions
apply()方法 ,apply()方法中的函数是calculate_CI_score,该函数还是可计算当年统计的CI分数和三年后统计的CI分数,该函数如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def calculate_CI_score (x,periods = 'current' ,start_year = None ): if periods == 'current' : questions = ['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ] elif periods == 'three_years_later' : year_index = '_' +str (start_year+3 )[-1 ] if start_year + 3 < 2010 else '_' +str (start_year+3 )[-2 :] questions = list (map (lambda x:x+year_index,['c11' ,'c12' ,'c13' ,'c14' ,'c15' ,'c16' ,'c21a' ,'c21b' ,'c21c' ,'c31a' ,'c31b' ,'c31c' ,'c31d' ,'c31e' ,'c32' ,'c41a' ,'c41b' ,'c41c' ,'c51a' ,'c51b' ,'c52' ,'c53a' ,'c53b' ,'c53c' ])) CI_score = 0 for question in questions: if question[0 :3 ] != 'c16' : if x[question] == 1 :CI_score += 1 else : food_score = x[question] if x[question]<=7 else 7 CI_score += food_score return CI_score
具体统计的CI分数和三年后统计的CI分数
1 2 new_df['current_CI_score' ] = new_df.apply(lambda x:calculate_CI_score(x),axis = 1 ) new_df['later_CI_score' ] = new_df.apply(lambda x:calculate_CI_score(x,periods = 'three_years_later' ,start_year = start_year),axis = 1 )
统计完成后新增了分数列,删除相关变量列,凡是会对原数组作出修改并返回一个新数组的,往往都有一个 inplace可选参数。如果手动设定为True(默认为False),那么原数组直接就被替换。也就是说,采用inplace=True之后,原数组名(如2和3情况所示)对应的内存值直接改变; (1)drop函数的使用:删除行、删除列
1 2 print frame.drop(['a' ])print frame.drop(['Ohio' ], axis = 1 )
本文是预测认知障碍,认知障碍被定义为 MMSE 评分低于 18,所以使用函数判断是否有障碍
1 2 3 4 5 def get_CI_label (x,periods ): if periods == 'current' : return 0 if x.current_CI_score>= 18 else 1 if periods == 'three_years_later' : return 0 if x.later_CI_score>= 18 else 1
一下代码中清理行变量,人数的关键是new_df.dropna(subset = ['depression_score','depression_score_later'],axis = 0,inplace = True)
1 2 3 4 print ('Before drop Depression missing : ' ,len (new_df))new_df.drop(questions,axis = 1 ,inplace = True ) new_df.dropna(subset = ['depression_score' ,'depression_score_later' ],axis = 0 ,inplace = True ) print ('After drop Depression missing : ' ,len (new_df))
计算只留下所有问题都回答了的样本的抑郁得分,其余的depression_score以nan填充,再用new_df.dropna(subset = ['depression_score','depression_score_later'],axis = 0,inplace = True)把nan的人数去除掉 变量处理语句
1 2 df['a1' ] = df['a1' ].apply(lambda x:0 if x==2 else 1 )
根据a1的单元格数值x修改a1为0女,1男可能是作者习惯吧,然后用def process_personal_variant_variable(df,start_year)也是继续修改赋值 总体改变问卷变量的标识使其更容易理解
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 def get_training_df (df1,df2,df3,df4 ): drop_list = ['d72' ,'d75' ,'d82' ,'d86' ,'d92' ] rename_dict_base = {'a1' :'gender' ,'a43' :'residence' ,'a51' :'co-habitation' ,'d71' :'smoke' ,'d81' :'alcohol' ,'d91' :'exercise' ,'d31' :'eat_fruits' ,'d32' :'eat_vegs' ,'d11b' :'outdoor_activity' ,'d11c' :'plant_flower_bird' ,'d11d' :'read_news_books' , 'd11e' :'raise_domestic_animals' ,'d11f' :'majong_or_cards' ,'d11g' :'tv_or_radio' ,'d11h' :'social_activity' ,'d12' :'tour_times' , 'g15a1' :'hypertension' ,'g15b1' :'diabetes' ,'g15c1' :'heart_disease' ,'g15d1' :'stoke_CVD' ,'g15e1' :'trachea_or_lung' , 'g15f1' :'tuberculosis' ,'g15g1' :'cataracts' ,'g15h1' :'glaucoma' ,'g15i1' :'cancer' ,'g15j1' :'prostate' ,'g15k1' :'stomach_ulcer' , 'g15l1' :'Parkinson' ,'g15m1' :'bedsore' ,'g15n1' :'arthritis' ,'g15o1' :'dementia' } later_year = [2005 ,2008 ,2011 ,2014 ] for df,year in zip ([df1,df2,df3,df4],later_year): year_index = '_' +str (year)[-1 ] if year<2011 else '_' +str (year)[-2 :] drop_list_later = list (map (lambda x:x+year_index,drop_list)) rename_dict = rename_dict_base.copy() for key in rename_dict_base.keys(): later_key = key+year_index rename_dict[later_key] = rename_dict[key]+'_3years_later' df.drop(drop_list+drop_list_later,axis = 1 ,inplace = True ) df.rename(columns = rename_dict,inplace = True ) ind = -2 if year < 2011 else -3 for col in df.columns: if col[ind:] == year_index: df.rename(columns = {col:col[:ind]+'_3years_later' },inplace = True ) return pd.concat([df1,df2,df3,df4],axis = 0 )
构建CI分数变化值,如遇到数值型变量则先标准化,特征工程中的样本不均衡处理方法是本代码中的重点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 def get_total_df (wave1,wave2,wave3,wave4 ): df1 = wave1.copy() df2 = wave2.copy() df3 = wave3.copy() df4 = wave4.copy() total_df = get_training_df(df1,df2,df3,df4) total_df = total_df[total_df.dementia == 0 ] total_df = total_df[total_df.dementia_3years_later == 0 ] del total_df['dementia' ] del total_df['dementia_3years_later' ] scaler1,scaler2,scaler3,scaler4 = StandardScaler(),StandardScaler(),StandardScaler(),StandardScaler() total_df['depression_score' ] = scaler1.fit_transform(np.array(total_df['depression_score' ]).reshape(-1 ,1 )) total_df['depression_score_later' ] = scaler2.fit_transform(np.array(total_df['depression_score_later' ]).reshape(-1 ,1 )) total_df['tour_times' ] = scaler3.fit_transform(np.array(total_df['tour_times' ]).reshape(-1 ,1 )) total_df['tour_times_3years_later' ] = scaler4.fit_transform(np.array(total_df['tour_times_3years_later' ]).reshape(-1 ,1 )) total_df['CI_diff' ] = total_df['later_CI_score' ] - total_df['current_CI_score' ] return total_df
以下代码用于统计性别属性中,认知障碍是否有认知障碍的人数和占比,Groupby用法 ,读取第二列的值data1 = data.iloc[:, 1] 对某个特征var的总体CI进行分类
1 2 3 4 5 6 7 8 9 10 11 12 def get_categorical_describe (total_df,var ): grouped = total_df.groupby(['later_CI' ,var]).count().iloc[:,1 ] not_CI = grouped[list (map (lambda x:True if x[0 ]==0 else False ,list (grouped.index)))] CI = grouped[list (map (lambda x:True if x[0 ]==1 else False ,list (grouped.index)))] res_notCI = [0 ]*len (total_df[var].unique()) res_CI = [0 ]*len (total_df[var].unique()) for i in range (len (not_CI)): res_notCI[i] = str (not_CI.iloc[i])+ '(' + str ('%.2f' %(not_CI.iloc[i]/(not_CI.iloc[i]+CI.iloc[i])*100 )) +')' res_CI[i] = str (CI.iloc[i]) + '(' + str ('%.2f' %(CI.iloc[i]/(not_CI.iloc[i]+CI.iloc[i])*100 )) + ')' df = pd.concat([pd.DataFrame(res_notCI),pd.DataFrame(res_CI)],axis = 1 ) df.columns = ['not_CI' ,'CI' ] return df
按vae变量算later_CI的Mean(SD)值,重点在于理解mean_ = total_df.groupby('later_CI').mean()[var] 在期刊论文中表示标准差时,请写成“mean (SD)”也就是说输出的第一个变量是mean,第二个变量是标准差SD
1 2 3 4 5 6 7 def get_numerical_describe (total_df,var ): mean_ = total_df.groupby('later_CI' ).mean()[var] std_ = total_df.groupby('later_CI' ).std()[var] res = [0 ]*len (mean_) for i in range (len (mean_)): res[i] = str ('%.4f' %mean_[i]) + '(' + str ('%.4f' %std_[i]) + ')' return pd.DataFrame({'not_CI' :res[0 ],'CI' :res[1 ]},index = ['Mean(SD)' ])
默认情况下,pandas 是不超出屏幕的显示范围的,如果表的行数很多,它会截断中间的行只显示一部分。我们可以通过设置display.max_rows来控制显示的最大行数,比如我想设置显示200行。pd.set_option('display.max_rows', 200) 然后统计统计各个变量的函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 def get_var_describe_test (df,var ): population = list (df.groupby(var).count().iloc[:,1 ].sort_index()) population_rate = list (map (lambda x:str (x)+'(' +str ('{:.2f}' .format (x/sum (population)*100 ))+'%' +')' ,population)) value_index = list (df.groupby(var).count().iloc[:,1 ].sort_index().index) current_CI_count = df.groupby([var,'current_CI' ]).count().iloc[:,1 ] CI_now_index = list (map (lambda x:True if x[1 ] == 1 else False ,current_CI_count.index)) not_CI_now_index = list (map (lambda x:True if x[1 ] == 0 else False ,current_CI_count.index)) CI_now_population = list (current_CI_count[CI_now_index]) not_CI_now_population = list (current_CI_count[not_CI_now_index]) for i,value in enumerate (not_CI_now_population): if value/population[i] == 1 : CI_now_population.insert(i,0 ) not_CI_now_population[i] = str (not_CI_now_population[i])+'(' +str ('{:.2f}' .format (not_CI_now_population[i]/population[i]*100 ))+'%' +')' for i,value in enumerate (CI_now_population): CI_now_population[i] = str (CI_now_population[i])+'(' +str ('{:.2f}' .format (CI_now_population[i]/population[i]*100 ))+'%' +')' later_CI_count = df.groupby([var,'later_CI' ]).count().iloc[:,1 ] CI_later_index = list (map (lambda x:True if x[1 ] == 1 else False ,later_CI_count.index)) not_CI_later_index = list (map (lambda x:True if x[1 ] == 0 else False ,later_CI_count.index)) CI_later_population = list (later_CI_count[CI_later_index]) not_CI_later_population = list (later_CI_count[not_CI_later_index]) for i,value in enumerate (not_CI_later_population): if value/population[i] == 1 : CI_later_population.insert(i,0 ) not_CI_later_population[i] = str (not_CI_later_population[i])+'(' +str ('{:.2f}' .format (not_CI_later_population[i]/population[i]*100 ))+'%' +')' for i,value in enumerate (CI_later_population): CI_later_population[i] = str (CI_later_population[i])+'(' +str ('{:.2f}' .format (CI_later_population[i]/population[i]*100 ))+'%' +')' return pd.DataFrame({'value' :value_index,'population' :population_rate,'CI now' :CI_now_population,'not CI now' :not_CI_now_population, 'CI 3 years later' :CI_later_population,'not CI 3 years later' :not_CI_later_population})
在第一部分代码中最后还对各个变量进行了假设检验,python内置values()函数返回一个字典中所有值。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def get_pvalues (total_df ): var_list = ['agegroup' ,'nation_group' ,'education' ,'martial_status' ,'economy_status' ,'ADL' ] rename_dict_base = {'a1' :'gender' ,'a43' :'residence' ,'a51' :'co-habitation' ,'d71' :'smoke' ,'d81' :'alcohol' ,'d91' :'exercise' ,'d31' :'eat_fruits' ,'d32' :'eat_vegs' ,'d11b' :'outdoor_activity' ,'d11c' :'plant_flower_bird' ,'d11d' :'read_news_books' , 'd11e' :'raise_domestic_animals' ,'d11f' :'majong_or_cards' ,'d11g' :'tv_or_radio' ,'d11h' :'social_activity' , 'g15a1' :'hypertension' ,'g15b1' :'diabetes' ,'g15c1' :'heart_disease' ,'g15d1' :'stoke_CVD' ,'g15e1' :'trachea_or_lung' , 'g15f1' :'tuberculosis' ,'g15g1' :'cataracts' ,'g15h1' :'glaucoma' ,'g15i1' :'cancer' ,'g15j1' :'prostate' ,'g15k1' :'stomach_ulcer' , 'g15l1' :'Parkinson' ,'g15m1' :'bedsore' ,'g15n1' :'arthritis' } descrptive_vars = var_list + list (rename_dict_base.values()) numerical_vars = ['depression_score' ,'tour_times' ] var_name = [] pvalues = [] for var in descrptive_vars: kf_data = np.array(total_df.groupby(['later_CI' ,var]).count().iloc[:,0 ]).reshape(2 ,len (total_df[var].unique())) kf = chi2_contingency(kf_data) var_name.append(var) pvalues.append(kf[1 ]) for var in numerical_vars: a = total_df[total_df.later_CI == 0 ][var] b = total_df[total_df.later_CI == 1 ][var] t_res = ttest_ind(a, b) var_name.append(var) pvalues.append(t_res.pvalue) df_out = pd.DataFrame({'Variable' :var_name,'p-values' :pvalues}) return df_out
p值检验,descrptive_vars和numerical_vars的p值算法是不一样的,前者用chi2_contingency卡方检验后者用ttest_ind
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def get_pvalues (total_df ): var_list = ['agegroup' ,'nation_group' ,'education' ,'martial_status' ,'economy_status' ,'ADL' ] rename_dict_base = {'a1' :'gender' ,'a43' :'residence' ,'a51' :'co-habitation' ,'d71' :'smoke' ,'d81' :'alcohol' ,'d91' :'exercise' ,'d31' :'eat_fruits' ,'d32' :'eat_vegs' ,'d11b' :'outdoor_activity' ,'d11c' :'plant_flower_bird' ,'d11d' :'read_news_books' , 'd11e' :'raise_domestic_animals' ,'d11f' :'majong_or_cards' ,'d11g' :'tv_or_radio' ,'d11h' :'social_activity' , 'g15a1' :'hypertension' ,'g15b1' :'diabetes' ,'g15c1' :'heart_disease' ,'g15d1' :'stoke_CVD' ,'g15e1' :'trachea_or_lung' , 'g15f1' :'tuberculosis' ,'g15g1' :'cataracts' ,'g15h1' :'glaucoma' ,'g15i1' :'cancer' ,'g15j1' :'prostate' ,'g15k1' :'stomach_ulcer' , 'g15l1' :'Parkinson' ,'g15m1' :'bedsore' ,'g15n1' :'arthritis' } descrptive_vars = var_list + list (rename_dict_base.values()) numerical_vars = ['depression_score' ,'tour_times' ] var_name = [] pvalues = [] for var in descrptive_vars: kf_data = np.array(total_df.groupby(['later_CI' ,var]).count().iloc[:,0 ]).reshape(2 ,len (total_df[var].unique())) kf = chi2_contingency(kf_data) var_name.append(var) pvalues.append(kf[1 ]) for var in numerical_vars: a = total_df[total_df.later_CI == 0 ][var] b = total_df[total_df.later_CI == 1 ][var] t_res = ttest_ind(a, b) var_name.append(var) pvalues.append(t_res.pvalue) df_out = pd.DataFrame({'Variable' :var_name,'p-values' :pvalues}) return df_out
第一部分结束 第二部分: 先沿用第一部分合并数据成总的df,然后计算三年变量和当年变量的diff差异如以下函数所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 def get_current_diff_variable_list (total_df ): current_variables = [] later_variables = [] for col in total_df.columns: if col[-12 :] == '3years_later' : later_variables.append(col) if col[-12 :] != '3years_later' : current_variables.append(col) for var in ['current_CI_score' , 'later_CI_score' , 'current_CI' , 'later_CI' , 'depression_score_later' , 'ADL_later' , 'CI_diff' ]: current_variables.remove(var) later_variables.extend(['ADL_later' ,'depression_score_later' ]) person_basic_variables = ['gender' ,'residence' ,'education' ,'borned_prov' ,'nation_group' ] current_variables_without_personal = current_variables.copy() for var in person_basic_variables: current_variables_without_personal.remove(var) current_variables_without_personal.sort() later_variables.sort() print ('Creating diff variables based on current variables and 3 years later variables ...' ) for x,y in zip (current_variables_without_personal,later_variables): total_df[x+'_diff' ] = total_df[y] - total_df[x] print ('Done!' ) diff_var = list (map (lambda x:x+'_diff' ,current_variables_without_personal)) current_variables.remove('dementia' ) diff_var.remove('dementia_diff' ) return current_variables,diff_var
然后调用逻辑回归,这里调用的是model = sm.Logit(y,x.astype(float)).fit()应该是个库与先前教程中的是不同的是有点复杂的以后有做这方面研究的话可以看,这里其实是用了多重线性回归。
1 2 3 total_df_for_logit = total_df.copy() for col in total_df_for_logit.columns: total_df_for_logit[col] = pd.to_numeric(total_df_for_logit[col],downcast='float' )
具体逻辑回归函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 def logistic_regression (input_df,label_column,categorical_variable_list=[],CI_alpha=0.05 ,standardized_diff = False ): df = input_df.copy() if standardized_diff: diff_var = list (filter (lambda x:True if x[-4 :] == 'diff' else False ,df.columns)) if diff_var: print ('Standardizing diff variables ...' ) for var in diff_var: df[var] = df[var].apply(lambda x:x if x<= 0 else 1 ) df[var] = df[var].apply(lambda x:x if x>=0 else -1 ) print ('Standardize diff variables done!' ) x_col = [column for column in df.columns if label_column not in column and column not in categorical_variable_list] y_col = [column for column in df.columns if column == label_column] if 'borned_prov' in x_col:x_col.remove('borned_prov' ) if 'agegroup_3years_later' in x_col:x_col.remove('agegroup_3years_later' ) if 'agegroup_diff' in x_col:x_col.remove('agegroup_diff' ) x = np.array(df[x_col]) if categorical_variable_list: onehot_cat_var = pd.get_dummies(df[categorical_variable_list[0 ]],prefix = categorical_variable_list[0 ]) drop_ref_col = list (filter (lambda x:True if x[-2 :] == '_0' or x[-4 :] == '_0.0' else False ,onehot_cat_var.columns)) onehot_cat_var.drop(drop_ref_col,axis = 1 ,inplace = True ) for variable in categorical_variable_list[1 :]: tmp = pd.get_dummies(df[variable],prefix = variable) drop_ref_col = list (filter (lambda x:True if x[-2 :] == '_0' or x[-4 :] == '_0.0' else False ,tmp.columns)) tmp.drop(drop_ref_col,axis = 1 ,inplace = True ) onehot_cat_var = pd.concat([onehot_cat_var,tmp],axis = 1 ) onehot_cat_var_values = np.array(onehot_cat_var) x = np.hstack([x,onehot_cat_var_values]) y = np.array(df[label_column].astype(float )).reshape(-1 ) model = sm.Logit(y,x.astype(float )).fit() para = pd.DataFrame(np.exp(model.conf_int(CI_alpha)),columns = ['95%lower' ,'95%upper' ]) para['coef' ] = model.params para['OddRatio' ] = np.exp(para.coef) x_name = x_col.copy() if categorical_variable_list: x_name.extend(onehot_cat_var.columns) para['Variable' ] = x_name para['pvalues' ] = model.pvalues var_value_count = [] for vari in x_name: if vari[-1 ] == '0' : if vari[-4 ] == '-' : var_value = float (vari[-4 :]) var_name = vari[:-5 ] print (vari) var_value_count.append(len (df[df[var_name]==var_value])) else : var_value = float (vari[-3 :]) var_name = vari[:-4 ] var_value_count.append(len (df[df[var_name]==var_value])) else : var_value_count.append(len (df[df[vari]==1 ])) para['Num of samples' ] = var_value_count para = para[['Variable' ,'Num of samples' ,'coef' ,'OddRatio' ,'95%lower' ,'95%upper' ,'pvalues' ]] std_x = np.array((df[x_col]-np.mean(df[x_col]))/np.std(df[x_col])) if categorical_variable_list: std_x = np.hstack([std_x,onehot_cat_var_values]) model2 = sm.Logit(y,std_x.astype(float )).fit() importance_df = pd.DataFrame(para['Variable' ]) importance_df['coef' ] = model2.params importance_df.sort_values(by = 'coef' ,ascending=False ,key=np.abs ,inplace = True ) importance_df['importance' ] = range (1 ,len (importance_df)+1 ) importance_df.drop('coef' ,axis=1 ,inplace=True ) importance_df.set_index('importance' ,inplace = True ) res = {'summary' :model.summary(), 'params' :para, 'feature_importance' :importance_df} return res
在上一步逻辑回归中计算了百分之95置信区间此处用在OR()中,OR值是《流行病学》中的重要概念,称作“优势比”(odds ratio),也称“比值比”,反映的是某种暴露与结局的关联强度。逻辑回归和OR值 ,所谓研究“暴露对结局”的影响,这里的“结局”在本例中就指“是否患有糖尿病”,一般可以等同于我们前面说的“因变量Y”。 所谓的“优势”可以理解为“暴露比值”!那怎么理解暴露比值呢? 在本例中,对于患有糖尿病的对象,暴露比值为:吸烟的比例除以不吸烟的比例,即为:24/16 = 1.50; 同样,在不患有糖尿病的人群中,也可以计算一个吸烟的比例除以不吸烟的比例,即为:18/22 = 0.82。 把这两个比例相除,就得到了吸烟与糖尿病相关关系的OR值,即OR = 1.50/0.82= 1.83>1,说明是比较相关的也是一个分析相关性的数据,p值可能更多的是分析差异性【重要但本代码中的OR值还要结合逻辑回归代码比较是用逻辑回归模型估计比值比(OR)和 95% 置信区间 (CI),本文的OR小于1是降低了认知障碍发生的风险。】
1 2 def get_OR (x ): return str ('%.2f' %x['OddRatio' ])+'(' +str ('%.2f' %x['95%lower' ]) +',' +str ('%.2f' %x['95%upper' ]) + ')'
本代码还计算了差异的相关性
1 output2.loc[list (set (output2.index)-set (output1.Variable)),:].sort_values(by = 'Variable' )
筛选出p值小于0.05的差异变量
1 2 significant_pos_diff = output2.loc[diff_variables_pos,:].query('pvalues<0.05' ) significant_pos_diff
后面还需要以后研究
1 2 3 need_divide_diff = list (map (lambda x:x[:-4 ],list (significant_pos_diff.index))) if 'exercise_diff' in need_divide_diff: need_divide_diff.remove('exercise_diff' )
最后本文为比较了各种算法的结果包括SVM最后比较了他们的准确性等指标,这篇文章还分析了纵向行为变化与认知障碍之间关联