Step 5: Data Entry and Data Cleaning

From Akvopedia
Revision as of 22:38, 21 September 2016 by Winona (talk | contribs) (Check skip patterns)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Data Entry

Spreadsheets are a commonly used and easily understood tool that do not require extensive, specialized training to be used by most office personnel. They are not the perfect data entry platform, but their widespread use and ability to handle data easily make them an acceptable, low-cost solution.

As questionnaires are returned by enumerators, survey managers should complete the following steps:

Check interview data

The first step in checking questionnaires as they are returned by enumerators is for the survey manager to check the interview data on the front page of each questionnaire. If the interview data are complete and legible, the questionnaire has passed the first check for completeness. The survey manager should build the full identification number on the front page of the questionnaire at this point.

If the interview data are not complete, under some circumstances they can be reconstructed. If the enumerator only worked in one survey area for example, incomplete location data can be provided easily . If the enumerator worked in multiple survey areas, but addresses for each interview are being recorded, incomplete location data can often be reconstructed. Missing enumerator identification numbers can always be entered as questionnaires are reviewed with the enumerator. If questionnaires are turned in on a daily basis (as we recommend), missing date information can be easily replaced. Missing interview number data can be easily replaced by a process of elimination. If no interview number data are provided for a given enumerator's daily batch of spreadsheets, we can simply order the questionnaires according to the enumerator's recollection of when interviews occurred. Ordering questionnaires based on interviewer recall is less than ideal, however, because it eliminates the ability of the analysts to look for certain effects that can arise based on the order in which interviews occur within the day. This is a very detailed type of analysis, however, and the loss would typically be quite minimal.

Scan questionnaires for completeness

In addition to reviewing interview data, survey managers should also check if questionnaires have been completed. Questionnaires that are missing responses for a substantial portion of questions should not be included in the data set. In particular, the questions that focus on customer satisfaction, customer improvement priorities, willingness to pay for improvements, and other main areas must be complete for the questionnaire to be usable for analysis. Incomplete questionnaires can either be completed on the following day by a return visit of the enumerator to the relevant interviewee or discarded.

Data Cleaning

Data cleaning refers to a several-step process used to ensure that survey data have been entered correctly both by the interviewer and by data entry personnel. In the bulleted text that follows, we describe each of these simple steps in detail. We divide data cleaning into two stages, (1) data cleaning that occurs as questionnaires are returned by enumerators during the survey and (2) data cleaning that occurs after all questionnaires have been entered into a spreadsheet file.

Once the entire data file has been entered, the data cleaner--typically one of the survey analysts--should complete the following:

Check the identification number field

Proper checking of questionnaires as they are returned by enumerators should catch the overwhelming majority of problems with interview data. Data entry personnel should catch the few remaining errors as they key in questionnaires. Data cleaners should, however, check identification numbers with particular care. The identification number is the piece of information that is used to group sampling units by geographic area, a critical step for analysts investigating the geographic dimensions of customer characteristics and responses.

Perform range checks for all questions (fields)

All closed-ended questions and the vast majority of open-ended questions have a limited range of acceptable responses. Closed-ended questions are quite easy to check. When respondents answer a yes-no question, for example, their responses will be coded as “1” for “yes” and “2” for no. The data cleaner simply tabulates responses for each question or identifies the largest and smallest response for each question. The paper questionnaire for each record having an error is then pulled and the relevant question checked. The response is either corrected or if the paper questionnaire shows the same out-of-range response, that respondent is excluded from analysis of that question. Alternatively, formulae which perform range checks can be included in the data entry spreadsheet, as described above.

Establishing what constitutes the acceptable response range for open-ended questions is more difficult and requires the use of judgment on the part of survey data cleaners and analysts. In a survey where respondents report the number of people living in their household, for example, it would generally be logical to view a response of 233 as incorrect. What about 27, however? Or 19? Making decisions related to such borderline cases is where analyst knowledge of the particular population under investigation becomes indispensable.

Check skip patterns

The use of filter, selection, or skip questions is an important tool of the questionnaire designer. Skip patterns will not, unfortunately, always be followed correctly by enumerators. Let’s think about an example: respondents might be instructed to skip questions related to satisfaction with a particular utility company's service if they are not a customer of that company. If a respondent states that they are not a customer but then proceeds to answer questions related to their satisfaction with the company, data cleaners and analysts should make certain that such responses are not used in analysis. These responses would likely be the respondent’s impressions of the satisfaction of customers, which is quite different from their own satisfaction with a service they consume.

Back to Step 4: Sampling

Proceed to Step 6: Data Analysis

Back to Evidence Gathering page