原文链接

复现

from CLHLS.get_3year_data import get_period_data中的CLHLS代表文件夹后的.参考
用python处理时要把.sav转为.csv文件：法一：用spss导出不过发现导出的文件不太符合规则,选择utf-8，但是输出结果与作者的一致，因此应该没有问题。
法二：用R中的模块

1 2	library(foreign) write.table(read.spss("inFile.sav"), file="outFile.csv", quote = TRUE, sep = ",")

文件Descriptive Data.ipynb用于数据预处理。
Python enumerate() 函数
 .columns属性以返回给定 DataFrame 的列标签
csv文件形式和sav一样所以直接用sav看更直观，csv只是处理数据比较方便，变量视图可以看到问过的问题和答案，变量_5代表随访到2005年时做过的调查
这些数据虽然有些是02到18有些是05到18，但是应该调查人群不同
.key()可以获得字典中所有的键，而不是键值。
语法dictionary.keys()
使用示例>>> list({'Chinasoft':'China', 'Microsoft':'USA'}.keys())
['Chinasoft', 'Microsoft']
在pandas中，value_counts常用于数据表的计数及排序，它可以用来查看数据表中，指定列里有多少个不同的数据值，并计算每个不同值有在该列中的个数，同时还能根据需要进行排序参考。
列表(List)
函数map() 会根据提供的函数对指定序列做映射。第一个参数 function 以参数序列中的每一个元素调用 function 函数，返回包含每次 function 函数返回值的新列表。
以下代码段的解释：

# 根据需要用到的变量名和三年后的后缀 得到 需要用到的三年后的变量名
## 11年前，后续年份随访调查的年龄字段叫vage_年份； 11年后，后续年份随访调查的年龄字段叫trueage_年份；
if start_year != 2011:
    three_years_later_columns = list(
        map(lambda x: x + year_index if x != 'trueage' else 'vage' + year_index, basic_columns))
else:
    three_years_later_columns = list(map(lambda x: x + year_index, basic_columns))

这个操作的作用就是获得三年后随访的变量名，11年前，后续年份随访调查的年龄字段叫vage_年份； 11年后，后续年份随访调查的年龄字段叫trueage_年份，这里的x取值和if start_year != 2011:这句话有关
因为个人信息问题只在初始年记录，后续年没有，把三年后的个人信息问题去掉留下三年后的问题信息，然后把起始年的信息加上，代码如下

# 个人信息问题只在初始年记录，后续年没有，把三年后的个人信息问题去掉
    for personal_stable_variable in list(map(lambda x: x + year_index, ['a2', 'f1', 'a1', 'a41', 'a43'])):
        three_years_later_columns.remove(personal_stable_variable)
#总共的信息把三年后的个人信息问题去掉留下三年后的问题信息，然后把起始年的信息加上
    total_columns = basic_columns + three_years_later_columns

筛选3年后随访还活着且没失访的老年人，比如dth11_14代表11年开始调查到14年还有是否活着0代表14年还存活
下面的代码用于处理有缺失值的问卷变量：
因为问卷中8代表缺失值，而其中一些变量值可取8因此先进行分类讨论：

# 以下问题的问卷选项中存在官方的取值8，或者有可能出现取值8。所以不能将这些变量的8作为缺失处理
questions_may_contain8 = ['vage', 'trueage', 'a2', 'a41', 'f1', 'b21', 'b22', 'b23', 'b24', 'b25', 'b26', 'b27',
                          'd75', 'd86', 'd12', 'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c21a', 'c21b', 'c21c',
                          'c31a', 'c31b', 'c31c', 'c31d', 'c31e', 'c32', 'c41a', 'c41b', 'c41c', 'c51a', 'c51b',
                          'c52', 'c53a', 'c53b', 'c53c']
year_index = '_' + str(start_year + 3)[-1] if start_year + 3 < 2010 else '_' + str(start_year + 3)[-2:]
later_questions_may_contain8 = list(map(lambda x: x + year_index, questions_may_contain8))
total_questions_may_contain8 = questions_may_contain8 + later_questions_may_contain8

而剩余变量含8的可能会是缺失值,replace函数，这段代码把由数字代表的缺失值用NAN描述了。

# 把剩下的变量中的按codebook说明的缺失取值或不合理取值变成缺失值。
    variable_needed_processed = list(set(df.columns) - set(total_questions_may_contain8))
    new_df = df.copy()
    for var in variable_needed_processed:
        new_df[var] = new_df[var].replace(8, np.nan)
        new_df[var] = new_df[var].replace(88, np.nan)
        new_df[var] = new_df[var].replace(888, np.nan)
        new_df[var] = new_df[var].replace(8888, np.nan)
        new_df[var] = new_df[var].replace(9, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(9999, np.nan)

下面代码似乎在计算各个变量中缺失的概率,df.isnull(),df.isnull().any()则会判断哪些”列”存在缺失值,new_df.loc[:, new_df.isnull().any()].isnull().sum()就是把所有有缺失值的列累加，最后除以总数，就是缺失的概率。


    print('Missing values situation')
    print(
        (new_df.loc[:, new_df.isnull().any()].isnull().sum() / len(df)).apply(lambda x: str('%.2f' % (x * 100)) + '%'))
# 用众数填补变量的缺失值这个待会打下断点
    for col in new_df.columns[new_df.isnull().any()]:
        mode = new_df[col].value_counts().sort_values(ascending=False).index[0]
        new_df[col].fillna(mode, inplace=True)

去掉MMSE问题全部不能回答【总共23个问题】的人员，也就是删掉一些行数，new_df = new_df[new_df.isnull().sum(axis=1) < 23]代表有任何一个问题能回答就保留，最后该段代码统计减去未统计MMSE的人数

# 去掉MMSE问题全部不能回答的
MMSE_questions = ['c11', 'c12', 'c13', 'c14', 'c15', 'c21a', 'c21b', 'c21c', 'c31a', 'c31b', 'c31c', 'c31d', 'c31e',
                  'c32', 'c41a', 'c41b', 'c41c', 'c51a', 'c51b', 'c52', 'c53a', 'c53b', 'c53c']
MMSE_questions_later = list(map(lambda x: x + year_index, MMSE_questions))
for var in MMSE_questions:
    new_df[var] = new_df[var].replace(8, np.nan)
    new_df[var] = new_df[var].replace(88, np.nan)
    new_df[var] = new_df[var].replace(888, np.nan)
    new_df[var] = new_df[var].replace(8888, np.nan)
    new_df[var] = new_df[var].replace(9, np.nan)
    new_df[var] = new_df[var].replace(99, np.nan)
    new_df[var] = new_df[var].replace(99, np.nan)
    new_df[var] = new_df[var].replace(9999, np.nan)
print('Samples before drop current MMSE missing : ', len(new_df))
new_df = new_df[new_df.isnull().sum(axis=1) < 23]
print('Samples before drop current MMSE missing : ', len(new_df))

for var in MMSE_questions_later:
    new_df[var] = new_df[var].replace(8, np.nan)
    new_df[var] = new_df[var].replace(88, np.nan)
    new_df[var] = new_df[var].replace(888, np.nan)
    new_df[var] = new_df[var].replace(8888, np.nan)
    new_df[var] = new_df[var].replace(9, np.nan)
    new_df[var] = new_df[var].replace(99, np.nan)
    new_df[var] = new_df[var].replace(99, np.nan)
    new_df[var] = new_df[var].replace(9999, np.nan)
print('Samples before drop later MMSE missing : ', len(new_df))
new_df = new_df[new_df.isnull().sum(axis=1) < 23]
print('Samples before drop later MMSE missing : ', len(new_df))

统计大于60岁的人数，并于之前对比。

print('Before drop age<60 : ')
print('Wave 1: ',len(first_wave_new))
print('Wave 2: ',len(second_wave_new))
print('Wave 3: ',len(third_wave_new))
print('Wave 4: ',len(fourth_wave_new))
first_wave_new = first_wave_new[first_wave_new.trueage>=60]
second_wave_new = second_wave_new[second_wave_new.trueage>=60]
third_wave_new = third_wave_new[third_wave_new.trueage>=60]
fourth_wave_new = fourth_wave_new[fourth_wave_new.trueage>=60]
print('After drop age<60 : ')
print('Wave 1: ',len(first_wave_new))
print('Wave 2: ',len(second_wave_new))
print('Wave 3: ',len(third_wave_new))
print('Wave 4: ',len(fourth_wave_new))

数据预处理总述

数据处理阶段一，纳入了 2002 年至 2014 年间的 4 波非重叠的 3 年调查数据，和所需的各个列变量【调查变量】，我们首先纳入了 2002 年至 2014 年间的 4 波非重叠的 3 年调查数据。参与者选择的详细流程图如图 1 所示。所有参与者每 3 年接受一次随访（一波）。

def get_data_between_start_to_end(df,start_year):
    # 根据输入df的起始年份得到三年后的变量后缀，如输入2002年的df，则三年后的变量后缀为 _5，输入2011年的，三年后变量后缀则为_14
    # 使用str()函数 转换为字符串使其可以使用下标索引访问
    short_year = str(start_year+3)[-1] if start_year+3 < 2010 else str(start_year+3)[-2:]
    variable_list = []
    index = 0
    minux_index = len(short_year)+1
    # 把起始年份的变量名称收集记起来，读取列标签名
    # 三年期的问卷变量
    for i,variable in enumerate(df.columns):
        #如果变量中包含截止年份信息
        if variable[-minux_index:] == '_' + short_year:
            index = i
            break
        #这里的语句没有在if里面需要特别注意
        variable_list.append(variable)
    print(index)
     # 把三年后的变量名称收集起来，每两次访问的变量之间都有一个"dth年份_年份"变量进行隔开"dth年份_年份"就相当于另一个周期，所以以该变量为分割提取，遇到该变量就停止
    # 当年的问卷变量
    for j in range(index,len(df.columns)):
        if df.columns[j] == 'dth'+str(start_year+3)[-2:]+'_'+str(start_year+6)[-2:] or df.columns[j][-3:] == '_'+str(start_year+7)[-2:]:
            break
        variable_list.append(df.columns[j])
    # 根据变量名提取出当年，和三年期的问卷变量，其他年份的抛弃
    print(variable_list)
    return df.loc[:,variable_list]

数据预处理和统计,根据变量dth02_05统计出三年间存活，死亡，失踪的各个项目人数

"""
数据预处理和统计
"""
def get_df_peo_3years_situation(data02to05,data05to08,data08to11,data11to14):
    #错误数据预处理
    # data11to14里的dth11_14字段中有' '空格字符，所以该字段为字符串类型，需要将其改为数值型
    data11to14.dth11_14 = pd.to_numeric(data11to14.dth11_14,errors = 'coerce')
    # data08to11中有一位老年人的dth11_14字段中值为2011，经核对该老人为存活者，所以将其改为0
    data08to11.loc[data08to11[data08to11.dth08_11 == 2011].index,'dth08_11'] = 0

    # 整理各组数据的起始调查人数，三年存活人数，三年死亡人数，三年失访人数{典型的无效数据}
    observations = []
    survivor = []
    death = []
    missing = []
    df_dict = {'df1':data02to05,'df2':data05to08,'df3':data08to11,'df4':data11to14}
    df_var_dict = {'df1':'dth02_05','df2':'dth05_08','df3':'dth08_11','df4':'dth11_14'}
    for df in df_dict.keys():
        print(df) #df1 df2 df3 df4
        print(df_dict[df]) #各个data02to05完整数组
        print(len(df_dict[df]))  # 各个data02to05完整数组的行数就是总统计人数
        counter = df_dict[df][df_var_dict[df]].value_counts()
        observations.append(len(df_dict[df]))
        survivor.append(counter[0])
        death.append(counter[1])
        missing.append(counter[-9])

    return pd.DataFrame({'start year observations':observations,
                     'survivor':survivor,
                      'death':death,
                      'missing':missing},index = ['2002 to 2005','2005 to 2008','2008 to 2011','2011 to 2014'])

situation = get_df_peo_3years_situation(data02to05,data05to08,data08to11,data11to14)
print('Number of investgated elders situations : ','\n',situation)

筛选出还活着的列表然后取特定的列变量，把三年后的初始个人信息问题去掉

def get_wave_data(df, start_year):
    # 需要用到的变量名
    basic_columns = ['a1', 'trueage', 'a2', 'f1', 'f41', 'f34', 'a41', 'a43', 'a51',  # personal info
                     'b21', 'b22', 'b23', 'b24', 'b25', 'b26', 'b27',  # psychological status
                     'd71', 'd72', 'd75', 'd81', 'd82', 'd86', 'd91', 'd92', 'd31', 'd32',  # living style
                     'd11b', 'd11c', 'd11d', 'd11e', 'd11f', 'd11g', 'd11h', 'd12',  # social activities
                     'e1', 'e2', 'e3', 'e4', 'e5', 'e6',  # ADL
                     'g15a1', 'g15b1', 'g15c1', 'g15d1', 'g15e1', 'g15f1', 'g15g1', 'g15h1', 'g15i1', 'g15j1', 'g15k1',
                     'g15l1', 'g15m1', 'g15n1', 'g15o1',  # chronic diseases
                     'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c21a', 'c21b', 'c21c', 'c31a', 'c31b', 'c31c', 'c31d',
                     'c31e', 'c32', 'c41a', 'c41b', 'c41c', 'c51a', 'c51b', 'c52', 'c53a', 'c53b', 'c53c',
                     # CI measurement
                     ]
    year_index = '_' + str(start_year + 3)[-1] if start_year + 3 < 2010 else '_' + str(start_year + 3)[-2:]

    # 根据需要用到的变量名和三年后的后缀 得到 需要用到的三年后的变量名
    ## 11年前，后续年份随访调查的年龄字段叫vage_年份； 11年后，后续年份随访调查的年龄字段叫trueage_年份；
    if start_year != 2011:
        three_years_later_columns = list(
            map(lambda x: x + year_index if x != 'trueage' else 'vage' + year_index, basic_columns))
        print(three_years_later_columns)
    else:
        three_years_later_columns = list(map(lambda x: x + year_index, basic_columns))
        print(three_years_later_columns)
    # 个人信息问题只在初始年记录，后续年没有，把三年后的个人信息问题去掉
    for personal_stable_variable in list(map(lambda x: x + year_index, ['a2', 'f1', 'a1', 'a41', 'a43'])):
        three_years_later_columns.remove(personal_stable_variable)
    total_columns = basic_columns + three_years_later_columns

    # 筛选3年后随访还活着且没失访的老年人
    death_index_var = 'dth' + str(start_year)[-2:] + '_' + str(start_year + 3)[-2:]
    #live_df变量的意思是筛选出还活着的列表
    live_df = df[df[death_index_var] == 0]
    print(live_df)
    # 筛选出还活着的列表然后取特定的列变量
    wave_data = live_df[total_columns]
    print(wave_data)
    for column in wave_data.columns:
        # 转为数字型
        wave_data[column] = pd.to_numeric(wave_data[column], errors='coerce')
    return wave_data

统计各个项的缺失概率，并用众数填补变量的缺失值并把MMSE问题全部不能回答的人去掉

def replace_abnormal_and_fill_nan(df, start_year):
    # 以下问题的问卷选项中存在官方的取值8，或者有可能出现取值8。所以不能将这些变量的8作为缺失处理
    questions_may_contain8 = ['vage', 'trueage', 'a2', 'a41', 'f1', 'b21', 'b22', 'b23', 'b24', 'b25', 'b26', 'b27',
                              'd75', 'd86', 'd12', 'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c21a', 'c21b', 'c21c',
                              'c31a', 'c31b', 'c31c', 'c31d', 'c31e', 'c32', 'c41a', 'c41b', 'c41c', 'c51a', 'c51b',
                              'c52', 'c53a', 'c53b', 'c53c']
    year_index = '_' + str(start_year + 3)[-1] if start_year + 3 < 2010 else '_' + str(start_year + 3)[-2:]
    later_questions_may_contain8 = list(map(lambda x: x + year_index, questions_may_contain8))
    total_questions_may_contain8 = questions_may_contain8 + later_questions_may_contain8
    # 把剩下的变量中的按codebook说明的缺失取值或不合理取值变成缺失值。
    variable_needed_processed = list(set(df.columns) - set(total_questions_may_contain8))
    new_df = df.copy()
    for var in variable_needed_processed:
        new_df[var] = new_df[var].replace(8, np.nan)
        new_df[var] = new_df[var].replace(88, np.nan)
        new_df[var] = new_df[var].replace(888, np.nan)
        new_df[var] = new_df[var].replace(8888, np.nan)
        new_df[var] = new_df[var].replace(9, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(9999, np.nan)
    # 用众数填补变量的缺失值
    print('Missing values situation')
    print(
        (new_df.loc[:, new_df.isnull().any()].isnull().sum() / len(df)).apply(lambda x: str('%.2f' % (x * 100)) + '%'))
    for col in new_df.columns[new_df.isnull().any()]:
        mode = new_df[col].value_counts().sort_values(ascending=False).index[0]
        new_df[col].fillna(mode, inplace=True)

    # 去掉MMSE问题全部不能回答的
    MMSE_questions = ['c11', 'c12', 'c13', 'c14', 'c15', 'c21a', 'c21b', 'c21c', 'c31a', 'c31b', 'c31c', 'c31d', 'c31e',
                      'c32', 'c41a', 'c41b', 'c41c', 'c51a', 'c51b', 'c52', 'c53a', 'c53b', 'c53c']
    MMSE_questions_later = list(map(lambda x: x + year_index, MMSE_questions))
    for var in MMSE_questions:
        new_df[var] = new_df[var].replace(8, np.nan)
        new_df[var] = new_df[var].replace(88, np.nan)
        new_df[var] = new_df[var].replace(888, np.nan)
        new_df[var] = new_df[var].replace(8888, np.nan)
        new_df[var] = new_df[var].replace(9, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(9999, np.nan)
    print('Samples before drop current MMSE missing : ', len(new_df))
    new_df = new_df[new_df.isnull().sum(axis=1) < 23]
    print('Samples before drop current MMSE missing : ', len(new_df))

    for var in MMSE_questions_later:
        new_df[var] = new_df[var].replace(8, np.nan)
        new_df[var] = new_df[var].replace(88, np.nan)
        new_df[var] = new_df[var].replace(888, np.nan)
        new_df[var] = new_df[var].replace(8888, np.nan)
        new_df[var] = new_df[var].replace(9, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(99, np.nan)
        new_df[var] = new_df[var].replace(9999, np.nan)
    print('Samples before drop later MMSE missing : ', len(new_df))
    new_df = new_df[new_df.isnull().sum(axis=1) < 23]
    print('Samples before drop later MMSE missing : ', len(new_df))

    return new_df

整合并筛选出>60岁的人

def get_period_data():
    print('Reading data ...')
    data_02to18 = pd.read_csv('./data/clhls_2002_2018_utf8.csv',low_memory=False)
    data_05to18 = pd.read_csv('./data/clhls_2005_2018_utf8.csv',low_memory=False)
    data_08to18 = pd.read_csv('./data/clhls_2008_2018_utf8.csv',low_memory=False)
    data_11to18 = pd.read_csv('./data/clhls_2011_2018_utf8.csv',low_memory=False)
    print('Getting 3 year periods data ...')
    data02to05 = get_data_between_start_to_end(data_02to18,2002)
    data05to08 = get_data_between_start_to_end(data_05to18,2005)
    data08to11 = get_data_between_start_to_end(data_08to18,2008)
    data11to14 = get_data_between_start_to_end(data_11to18,2011)
    
    situation = get_df_peo_3years_situation(data02to05,data05to08,data08to11,data11to14)
    print('Number of investgated elders situations : ','\n',situation)
    
    print('Selecting needed variables ...')
    first_wave = get_wave_data(data02to05,2002)
    second_wave = get_wave_data(data05to08,2005)
    third_wave = get_wave_data(data08to11,2008)
    fourth_wave = get_wave_data(data11to14,2011)
    print('Filling missing values ...')
    first_wave_new = replace_abnormal_and_fill_nan(first_wave,2002)
    second_wave_new = replace_abnormal_and_fill_nan(second_wave,2005)
    third_wave_new = replace_abnormal_and_fill_nan(third_wave,2008)
    fourth_wave_new = replace_abnormal_and_fill_nan(fourth_wave,2011)
    
    print('Before drop age<60 : ')
    print('Wave 1: ',len(first_wave_new))
    print('Wave 2: ',len(second_wave_new))
    print('Wave 3: ',len(third_wave_new))
    print('Wave 4: ',len(fourth_wave_new))
    first_wave_new = first_wave_new[first_wave_new.trueage>=60]
    second_wave_new = second_wave_new[second_wave_new.trueage>=60]
    third_wave_new = third_wave_new[third_wave_new.trueage>=60]
    fourth_wave_new = fourth_wave_new[fourth_wave_new.trueage>=60]
    print('After drop age<60 : ')
    print('Wave 1: ',len(first_wave_new))
    print('Wave 2: ',len(second_wave_new))
    print('Wave 3: ',len(third_wave_new))
    print('Wave 4: ',len(fourth_wave_new))
    return first_wave_new,second_wave_new,third_wave_new,fourth_wave_new

第二部分获得CI score，depression score，ADL score三项的分数
先定位到CI score，depression score，ADL score三项的变量

## CI score
    new_df = df.copy()
    current_questions = ['c11','c12','c13','c14','c15','c16','c21a','c21b','c21c','c31a','c31b','c31c','c31d','c31e','c32','c41a','c41b','c41c','c51a','c51b','c52','c53a','c53b','c53c']
    year_index = '_'+str(start_year+3)[-1] if start_year + 3 < 2010 else '_'+str(start_year+3)[-2:]
    three_year_questions = list(map(lambda x:x+year_index,['c11','c12','c13','c14','c15','c16','c21a','c21b','c21c','c31a','c31b','c31c','c31d','c31e','c32','c41a','c41b','c41c','c51a','c51b','c52','c53a','c53b','c53c']))
    questions = current_questions + three_year_questions
#     for column in questions:
#         new_df[column] = pd.to_numeric(new_df[column],errors = 'coerce')

apply()方法，apply()方法中的函数是calculate_CI_score，该函数还是可计算当年统计的CI分数和三年后统计的CI分数，该函数如下：

def calculate_CI_score(x,periods = 'current',start_year = None):
    if periods == 'current':
        questions = ['c11','c12','c13','c14','c15','c16','c21a','c21b','c21c','c31a','c31b','c31c','c31d','c31e','c32','c41a','c41b','c41c','c51a','c51b','c52','c53a','c53b','c53c']
    elif periods == 'three_years_later':
        year_index = '_'+str(start_year+3)[-1] if start_year + 3 < 2010 else '_'+str(start_year+3)[-2:]
        questions = list(map(lambda x:x+year_index,['c11','c12','c13','c14','c15','c16','c21a','c21b','c21c','c31a','c31b','c31c','c31d','c31e','c32','c41a','c41b','c41c','c51a','c51b','c52','c53a','c53b','c53c']))
    CI_score = 0
    for question in questions:
        if question[0:3] != 'c16': #如果不是数食物的题
            if x[question] == 1:CI_score += 1
        else: #如果是数食物的题
            food_score = x[question] if x[question]<=7 else 7
            CI_score += food_score
    return CI_score

具体统计的CI分数和三年后统计的CI分数

1
2

new_df['current_CI_score'] = new_df.apply(lambda x:calculate_CI_score(x),axis = 1)
    new_df['later_CI_score'] = new_df.apply(lambda x:calculate_CI_score(x,periods = 'three_years_later',start_year = start_year),axis = 1)

统计完成后新增了分数列，删除相关变量列，凡是会对原数组作出修改并返回一个新数组的，往往都有一个 inplace可选参数。如果手动设定为True（默认为False），那么原数组直接就被替换。也就是说，采用inplace=True之后，原数组名（如2和3情况所示）对应的内存值直接改变；
（1）drop函数的使用：删除行、删除列

1 2	print frame.drop(['a']) print frame.drop(['Ohio'], axis = 1)

本文是预测认知障碍，认知障碍被定义为 MMSE 评分低于 18，所以使用函数判断是否有障碍

def get_CI_label(x,periods):
    if periods == 'current':
        return 0 if x.current_CI_score>= 18 else 1
    if periods == 'three_years_later':
        return 0 if x.later_CI_score>= 18 else 1

一下代码中清理行变量，人数的关键是new_df.dropna(subset = ['depression_score','depression_score_later'],axis = 0,inplace = True)

print('Before drop Depression missing : ',len(new_df))
new_df.drop(questions,axis = 1,inplace = True)
new_df.dropna(subset = ['depression_score','depression_score_later'],axis = 0,inplace = True)
print('After drop Depression missing : ',len(new_df))

计算只留下所有问题都回答了的样本的抑郁得分，其余的depression_score以nan填充，再用new_df.dropna(subset = ['depression_score','depression_score_later'],axis = 0,inplace = True)把nan的人数去除掉
变量处理语句

1 2	# 性别改为0女，1男 df['a1'] = df['a1'].apply(lambda x:0 if x==2 else 1)

根据a1的单元格数值x修改a1为0女，1男可能是作者习惯吧，然后用def process_personal_variant_variable(df,start_year)也是继续修改赋值
总体改变问卷变量的标识使其更容易理解

def get_training_df(df1,df2,df3,df4):
    drop_list = ['d72','d75','d82','d86','d92'] # 用不到的变量
    rename_dict_base = {'a1':'gender','a43':'residence','a51':'co-habitation','d71':'smoke','d81':'alcohol','d91':'exercise','d31':'eat_fruits','d32':'eat_vegs','d11b':'outdoor_activity','d11c':'plant_flower_bird','d11d':'read_news_books',
                          'd11e':'raise_domestic_animals','d11f':'majong_or_cards','d11g':'tv_or_radio','d11h':'social_activity','d12':'tour_times',
                          'g15a1':'hypertension','g15b1':'diabetes','g15c1':'heart_disease','g15d1':'stoke_CVD','g15e1':'trachea_or_lung',
                          'g15f1':'tuberculosis','g15g1':'cataracts','g15h1':'glaucoma','g15i1':'cancer','g15j1':'prostate','g15k1':'stomach_ulcer',
                          'g15l1':'Parkinson','g15m1':'bedsore','g15n1':'arthritis','g15o1':'dementia'} # 把问卷题目代码改为变量名
    later_year = [2005,2008,2011,2014]
    
    for df,year in zip([df1,df2,df3,df4],later_year):
        year_index = '_'+str(year)[-1] if year<2011 else '_'+str(year)[-2:]
        drop_list_later = list(map(lambda x:x+year_index,drop_list))
        rename_dict = rename_dict_base.copy()
        for key in rename_dict_base.keys():
            later_key = key+year_index
            rename_dict[later_key] = rename_dict[key]+'_3years_later'

        df.drop(drop_list+drop_list_later,axis = 1,inplace = True)
        df.rename(columns = rename_dict,inplace = True)
        
        ind = -2 if year < 2011 else -3
        for col in df.columns:
            if col[ind:] == year_index:
                df.rename(columns = {col:col[:ind]+'_3years_later'},inplace = True)
    return pd.concat([df1,df2,df3,df4],axis = 0)

构建CI分数变化值，如遇到数值型变量则先标准化，特征工程中的样本不均衡处理方法是本代码中的重点

def get_total_df(wave1,wave2,wave3,wave4):
    df1 = wave1.copy()
    df2 = wave2.copy()
    df3 = wave3.copy()
    df4 = wave4.copy()

    total_df = get_training_df(df1,df2,df3,df4)

    #### 补处理

    # 去掉得痴呆症的老人，痴呆会对CI有影响
    total_df = total_df[total_df.dementia == 0]
    total_df = total_df[total_df.dementia_3years_later == 0]
    del total_df['dementia']
    del total_df['dementia_3years_later']


    # 标准化数值型变量
    scaler1,scaler2,scaler3,scaler4 = StandardScaler(),StandardScaler(),StandardScaler(),StandardScaler()
    total_df['depression_score'] = scaler1.fit_transform(np.array(total_df['depression_score']).reshape(-1,1))
    total_df['depression_score_later'] = scaler2.fit_transform(np.array(total_df['depression_score_later']).reshape(-1,1))
    total_df['tour_times'] = scaler3.fit_transform(np.array(total_df['tour_times']).reshape(-1,1))
    total_df['tour_times_3years_later'] = scaler4.fit_transform(np.array(total_df['tour_times_3years_later']).reshape(-1,1))

    
    # 构建变量：CI分数变化值
    total_df['CI_diff'] = total_df['later_CI_score'] - total_df['current_CI_score']
    return total_df

以下代码用于统计性别属性中，认知障碍是否有认知障碍的人数和占比，Groupby用法,读取第二列的值data1 = data.iloc[:, 1]
对某个特征var的总体CI进行分类

def get_categorical_describe(total_df,var):
    grouped = total_df.groupby(['later_CI',var]).count().iloc[:,1]
    not_CI = grouped[list(map(lambda x:True if x[0]==0 else False,list(grouped.index)))]
    CI = grouped[list(map(lambda x:True if x[0]==1 else False,list(grouped.index)))]
    res_notCI = [0]*len(total_df[var].unique())
    res_CI = [0]*len(total_df[var].unique())
    for i in range(len(not_CI)):
        res_notCI[i] = str(not_CI.iloc[i])+ '(' + str('%.2f' %(not_CI.iloc[i]/(not_CI.iloc[i]+CI.iloc[i])*100)) +')'
        res_CI[i] = str(CI.iloc[i]) + '(' + str('%.2f' %(CI.iloc[i]/(not_CI.iloc[i]+CI.iloc[i])*100)) + ')'
    df = pd.concat([pd.DataFrame(res_notCI),pd.DataFrame(res_CI)],axis = 1)
    df.columns = ['not_CI','CI']
    return df

按vae变量算later_CI的Mean(SD)值,重点在于理解mean_ = total_df.groupby('later_CI').mean()[var]
在期刊论文中表示标准差时，请写成“mean (SD)”也就是说输出的第一个变量是mean，第二个变量是标准差SD

def get_numerical_describe(total_df,var):
    mean_ = total_df.groupby('later_CI').mean()[var]
    std_ = total_df.groupby('later_CI').std()[var]
    res = [0]*len(mean_)
    for i in range(len(mean_)):
        res[i] = str('%.4f' %mean_[i]) + '(' + str('%.4f' %std_[i]) + ')'
    return pd.DataFrame({'not_CI':res[0],'CI':res[1]},index = ['Mean(SD)'])

默认情况下，pandas 是不超出屏幕的显示范围的，如果表的行数很多，它会截断中间的行只显示一部分。我们可以通过设置display.max_rows来控制显示的最大行数，比如我想设置显示200行。pd.set_option('display.max_rows', 200)
然后统计统计各个变量的函数

def get_var_describe_test(df,var):
    population = list(df.groupby(var).count().iloc[:,1].sort_index())
    population_rate = list(map(lambda x:str(x)+'('+str('{:.2f}'.format(x/sum(population)*100))+'%'+')',population))
    value_index = list(df.groupby(var).count().iloc[:,1].sort_index().index)

    current_CI_count = df.groupby([var,'current_CI']).count().iloc[:,1]
    CI_now_index = list(map(lambda x:True if x[1] == 1 else False,current_CI_count.index))
    not_CI_now_index = list(map(lambda x:True if x[1] == 0 else False,current_CI_count.index))
    CI_now_population = list(current_CI_count[CI_now_index])
    not_CI_now_population = list(current_CI_count[not_CI_now_index])
    
    for i,value in enumerate(not_CI_now_population):
        if value/population[i]  == 1:# 如果某组去都没有CI，为了保持列表长度相同，在CI组补0
            CI_now_population.insert(i,0)
        not_CI_now_population[i] = str(not_CI_now_population[i])+'('+str('{:.2f}'.format(not_CI_now_population[i]/population[i]*100))+'%'+')'
    for i,value in enumerate(CI_now_population):
        CI_now_population[i] = str(CI_now_population[i])+'('+str('{:.2f}'.format(CI_now_population[i]/population[i]*100))+'%'+')'
    

    later_CI_count = df.groupby([var,'later_CI']).count().iloc[:,1]
    CI_later_index = list(map(lambda x:True if x[1] == 1 else False,later_CI_count.index))
    not_CI_later_index = list(map(lambda x:True if x[1] == 0 else False,later_CI_count.index))
    CI_later_population = list(later_CI_count[CI_later_index])
    not_CI_later_population = list(later_CI_count[not_CI_later_index])
    
    for i,value in enumerate(not_CI_later_population):
        if value/population[i]  == 1:# 如果某组去都没有CI，为了保持列表长度相同，在CI组补0
            CI_later_population.insert(i,0)
        not_CI_later_population[i] = str(not_CI_later_population[i])+'('+str('{:.2f}'.format(not_CI_later_population[i]/population[i]*100))+'%'+')'
    
    for i,value in enumerate(CI_later_population):
        CI_later_population[i] = str(CI_later_population[i])+'('+str('{:.2f}'.format(CI_later_population[i]/population[i]*100))+'%'+')'
    

    return pd.DataFrame({'value':value_index,'population':population_rate,'CI now':CI_now_population,'not CI now':not_CI_now_population,
                          'CI 3 years later':CI_later_population,'not CI 3 years later':not_CI_later_population})

在第一部分代码中最后还对各个变量进行了假设检验,python内置values()函数返回一个字典中所有值。

def get_pvalues(total_df):
    var_list = ['agegroup','nation_group','education','martial_status','economy_status','ADL']
    rename_dict_base = {'a1':'gender','a43':'residence','a51':'co-habitation','d71':'smoke','d81':'alcohol','d91':'exercise','d31':'eat_fruits','d32':'eat_vegs','d11b':'outdoor_activity','d11c':'plant_flower_bird','d11d':'read_news_books',
                          'd11e':'raise_domestic_animals','d11f':'majong_or_cards','d11g':'tv_or_radio','d11h':'social_activity',
                          'g15a1':'hypertension','g15b1':'diabetes','g15c1':'heart_disease','g15d1':'stoke_CVD','g15e1':'trachea_or_lung',
                          'g15f1':'tuberculosis','g15g1':'cataracts','g15h1':'glaucoma','g15i1':'cancer','g15j1':'prostate','g15k1':'stomach_ulcer',
                          'g15l1':'Parkinson','g15m1':'bedsore','g15n1':'arthritis'}
    descrptive_vars = var_list + list(rename_dict_base.values())
    numerical_vars = ['depression_score','tour_times']
    var_name = []
    pvalues = []
    for var in descrptive_vars:
        kf_data = np.array(total_df.groupby(['later_CI',var]).count().iloc[:,0]).reshape(2,len(total_df[var].unique()))
        kf = chi2_contingency(kf_data)
        var_name.append(var)
        pvalues.append(kf[1])
    for var in numerical_vars:
        a = total_df[total_df.later_CI == 0][var]
        b = total_df[total_df.later_CI == 1][var]
        t_res = ttest_ind(a, b)
        var_name.append(var)
        pvalues.append(t_res.pvalue)
    df_out = pd.DataFrame({'Variable':var_name,'p-values':pvalues})
    return df_out

p值检验，descrptive_vars和numerical_vars的p值算法是不一样的，前者用chi2_contingency卡方检验后者用ttest_ind

def get_pvalues(total_df):
    var_list = ['agegroup','nation_group','education','martial_status','economy_status','ADL']
    rename_dict_base = {'a1':'gender','a43':'residence','a51':'co-habitation','d71':'smoke','d81':'alcohol','d91':'exercise','d31':'eat_fruits','d32':'eat_vegs','d11b':'outdoor_activity','d11c':'plant_flower_bird','d11d':'read_news_books',
                          'd11e':'raise_domestic_animals','d11f':'majong_or_cards','d11g':'tv_or_radio','d11h':'social_activity',
                          'g15a1':'hypertension','g15b1':'diabetes','g15c1':'heart_disease','g15d1':'stoke_CVD','g15e1':'trachea_or_lung',
                          'g15f1':'tuberculosis','g15g1':'cataracts','g15h1':'glaucoma','g15i1':'cancer','g15j1':'prostate','g15k1':'stomach_ulcer',
                          'g15l1':'Parkinson','g15m1':'bedsore','g15n1':'arthritis'}
    descrptive_vars = var_list + list(rename_dict_base.values())
    numerical_vars = ['depression_score','tour_times']
    var_name = []
    pvalues = []
    for var in descrptive_vars:
        kf_data = np.array(total_df.groupby(['later_CI',var]).count().iloc[:,0]).reshape(2,len(total_df[var].unique()))
        kf = chi2_contingency(kf_data)
        var_name.append(var)
        pvalues.append(kf[1])
    for var in numerical_vars:
        a = total_df[total_df.later_CI == 0][var]
        b = total_df[total_df.later_CI == 1][var]
        t_res = ttest_ind(a, b)
        var_name.append(var)
        pvalues.append(t_res.pvalue)
    df_out = pd.DataFrame({'Variable':var_name,'p-values':pvalues})
    return df_out

第一部分结束
第二部分：
先沿用第一部分合并数据成总的df，然后计算三年变量和当年变量的diff差异如以下函数所示：

def get_current_diff_variable_list(total_df):
    current_variables = []
    later_variables = []
    for col in total_df.columns:
        if col[-12:] == '3years_later':
            #三年后的变量
            later_variables.append(col)
        if col[-12:] != '3years_later':
            #当前的变量
            current_variables.append(col)

    for var in ['current_CI_score',
                 'later_CI_score',
                 'current_CI',
                 'later_CI',
                 'depression_score_later',
                 'ADL_later',
                 'CI_diff'
                 ]:
        current_variables.remove(var)

    later_variables.extend(['ADL_later','depression_score_later'])
    person_basic_variables = ['gender','residence','education','borned_prov','nation_group']

    current_variables_without_personal = current_variables.copy()
    #移除个人信息【初始值】
    for var in person_basic_variables:
        current_variables_without_personal.remove(var)

    current_variables_without_personal.sort()
    later_variables.sort()
    #计算差异
    print('Creating diff variables based on current variables and 3 years later variables ...')
    for x,y in zip(current_variables_without_personal,later_variables):
        total_df[x+'_diff'] = total_df[y] - total_df[x]
    print('Done!')
        
    diff_var = list(map(lambda x:x+'_diff',current_variables_without_personal))
    current_variables.remove('dementia')
    diff_var.remove('dementia_diff')
    return current_variables,diff_var

然后调用逻辑回归,这里调用的是model = sm.Logit(y,x.astype(float)).fit()应该是个库与先前教程中的是不同的是有点复杂的以后有做这方面研究的话可以看，这里其实是用了多重线性回归。

1
2
3

total_df_for_logit = total_df.copy()
for col in total_df_for_logit.columns:
    total_df_for_logit[col] = pd.to_numeric(total_df_for_logit[col],downcast='float')

具体逻辑回归函数

def logistic_regression(input_df,label_column,categorical_variable_list=[],CI_alpha=0.05,standardized_diff = False):
    df = input_df.copy()
    if standardized_diff:
        diff_var = list(filter(lambda x:True if x[-4:] == 'diff' else False,df.columns))
        if diff_var:
            print('Standardizing diff variables ...')
            for var in diff_var:
                df[var] = df[var].apply(lambda x:x if x<= 0 else 1)
                df[var] = df[var].apply(lambda x:x if x>=0 else -1)
            print('Standardize diff variables done!')
    
    # 一起逻辑回归
    x_col = [column for column in df.columns if label_column not in column and column not in categorical_variable_list]
    y_col = [column for column in df.columns if column == label_column]
    # 
    if 'borned_prov' in x_col:x_col.remove('borned_prov')
    if 'agegroup_3years_later' in x_col:x_col.remove('agegroup_3years_later')
    if 'agegroup_diff' in x_col:x_col.remove('agegroup_diff')
 
    x = np.array(df[x_col])
    if categorical_variable_list:
        onehot_cat_var = pd.get_dummies(df[categorical_variable_list[0]],prefix = categorical_variable_list[0])
        drop_ref_col = list(filter(lambda x:True if x[-2:] == '_0' or x[-4:] == '_0.0' else False,onehot_cat_var.columns))
        onehot_cat_var.drop(drop_ref_col,axis = 1,inplace = True)
        for variable in categorical_variable_list[1:]:
            tmp = pd.get_dummies(df[variable],prefix = variable)
            drop_ref_col = list(filter(lambda x:True if x[-2:] == '_0' or x[-4:] == '_0.0' else False,tmp.columns))
            tmp.drop(drop_ref_col,axis = 1,inplace = True)
            onehot_cat_var = pd.concat([onehot_cat_var,tmp],axis = 1)
        onehot_cat_var_values = np.array(onehot_cat_var)
        x = np.hstack([x,onehot_cat_var_values])
    y = np.array(df[label_column].astype(float)).reshape(-1)
    model = sm.Logit(y,x.astype(float)).fit()
    para = pd.DataFrame(np.exp(model.conf_int(CI_alpha)),columns = ['95%lower','95%upper'])
    para['coef'] = model.params
    para['OddRatio'] = np.exp(para.coef)
    x_name = x_col.copy()
    if categorical_variable_list:
        x_name.extend(onehot_cat_var.columns)
    para['Variable'] = x_name
    para['pvalues'] = model.pvalues

    var_value_count = []
    for vari in x_name:
        if vari[-1] == '0':
            if vari[-4] == '-':
                var_value = float(vari[-4:])
                var_name = vari[:-5]
                print(vari)
                var_value_count.append(len(df[df[var_name]==var_value]))
            else:
                var_value = float(vari[-3:])
                var_name = vari[:-4]
                var_value_count.append(len(df[df[var_name]==var_value]))
        else:
            var_value_count.append(len(df[df[vari]==1]))
    para['Num of samples'] = var_value_count
    para = para[['Variable','Num of samples','coef','OddRatio','95%lower','95%upper','pvalues']]
    std_x = np.array((df[x_col]-np.mean(df[x_col]))/np.std(df[x_col]))
    if categorical_variable_list:
        std_x = np.hstack([std_x,onehot_cat_var_values])
    model2 = sm.Logit(y,std_x.astype(float)).fit()
    importance_df = pd.DataFrame(para['Variable'])
    importance_df['coef'] = model2.params
    importance_df.sort_values(by = 'coef',ascending=False,key=np.abs,inplace = True)
    importance_df['importance'] = range(1,len(importance_df)+1)
    importance_df.drop('coef',axis=1,inplace=True)
    importance_df.set_index('importance',inplace = True)
    
#     # 单个变量回归
#     single_para = pd.DataFrame(columns = ['Variable','solo_coef','solo_OddRatio','solo_95%lower','solo_95%upper','solo_pvalues'])
#     for j in range(x.shape[1]):
#         single_x = x[:,j]
#         variable = x_name[j]
#         single_model = sm.Logit(y,single_x.astype(float)).fit()
#         solo_coef = single_model.params
#         solo_OR = np.exp(single_model.params)
#         solo_lower = np.exp(single_model.conf_int(CI_alpha)[0][0])
#         solo_upper = np.exp(single_model.conf_int(CI_alpha)[0][1])
#         solo_pvalue = single_model.pvalues
#         temp = pd.DataFrame({'Variable':variable,'solo_coef':solo_coef,'solo_OddRatio':solo_OR,'solo_95%lower':solo_lower,'solo_95%upper':solo_upper,'solo_pvalues':solo_pvalue})
#         single_para = pd.concat([single_para,temp],axis = 0)
#     para = pd.merge(para,single_para,on = 'Variable')  
                                
    res = {'summary':model.summary(),
       'params':para,
       'feature_importance':importance_df}
    return res

在上一步逻辑回归中计算了百分之95置信区间此处用在OR()中，OR值是《流行病学》中的重要概念，称作“优势比”（odds ratio），也称“比值比”，反映的是某种暴露与结局的关联强度。逻辑回归和OR值,所谓研究“暴露对结局”的影响，这里的“结局”在本例中就指“是否患有糖尿病”，一般可以等同于我们前面说的“因变量Y”。
所谓的“优势”可以理解为“暴露比值”！那怎么理解暴露比值呢？
在本例中，对于患有糖尿病的对象，暴露比值为：吸烟的比例除以不吸烟的比例，即为：24/16 = 1.50；
同样，在不患有糖尿病的人群中，也可以计算一个吸烟的比例除以不吸烟的比例，即为：18/22 = 0.82。
把这两个比例相除，就得到了吸烟与糖尿病相关关系的OR值，即OR = 1.50/0.82= 1.83>1,说明是比较相关的也是一个分析相关性的数据，p值可能更多的是分析差异性【重要但本代码中的OR值还要结合逻辑回归代码比较是用逻辑回归模型估计比值比（OR）和 95% 置信区间（CI），本文的OR小于1是降低了认知障碍发生的风险。】

1 2	def get_OR(x): return str('%.2f' %x['OddRatio'])+'('+str('%.2f' %x['95%lower']) +','+str('%.2f' %x['95%upper']) + ')'

本代码还计算了差异的相关性

1	output2.loc[list(set(output2.index)-set(output1.Variable)),:].sort_values(by = 'Variable')

筛选出p值小于0.05的差异变量

1 2	significant_pos_diff = output2.loc[diff_variables_pos,:].query('pvalues<0.05') significant_pos_diff

后面还需要以后研究

1
2
3

need_divide_diff = list(map(lambda x:x[:-4],list(significant_pos_diff.index)))
if 'exercise_diff' in need_divide_diff:
    need_divide_diff.remove('exercise_diff')

最后本文为比较了各种算法的结果包括SVM最后比较了他们的准确性等指标，这篇文章还分析了纵向行为变化与认知障碍之间关联